Available models
Nebius AI Studio supports the following model types:- Text-to-text
- Embedding (feature extraction)
- Text-to-image
- Safety guardrails
- Vision (image-text-to-text)
Model flavors
Inference performance is determined by how the endpoint is configured—balancing:- Batch size
- GPU type (L40, H100, H200, B200)
- GPU count
- Inference-time optimizations (e.g., speculative decoding)
Base
Fast
-fast
to the model name in the API.
These flavors are curated based on infrastructure and model-level expertise. For more specific needs—such as strict latency budgets, cost-efficiency targets, or high-throughput applications—endpoint configurations can be tailored to the customer use case through direct collaboration with the sales and solutions engineering team.
Inference optimisations
Our LLM inference service employs a range of optimization techniques to increase throughput while maintaining model quality. These techniques include- KV cache: A caching mechanism that stores frequently accessed key-value pairs, reducing the number of computations required.
- Paged attention: A technique that divides the input sequence into smaller chunks, processing each chunk separately to reduce memory usage and computation.
- Flash attention: A modified attention mechanism that reduces the number of computations required for attention calculations.
- Quantization: A technique that reduces the precision of model weights and activations, decreasing memory usage and computation.
- Continuous batching: A method that batches multiple input sequences together, increasing throughput by reducing the overhead of individual requests.
- Context caching: A caching mechanism that stores the context (i.e., the output of previous layers) for each input sequence, reducing the number of computations required.
- Speculative decoding: An advanced technique that uses auxiliary models to pre-generate likely next tokens. The main model verifies these predictions, reducing the number of forward passes required and improving throughput without degrading output quality.
Impact on model quality
Our optimization techniques are designed to minimize the impact on model quality. Through extensive testing and evaluation, we have found that our optimized models maintain approximately 99% of the original model’s quality. This means that the optimized models produce nearly identical results to the original models, with only minor differences in output.The quality impact of each optimization technique is carefully evaluated and monitored to ensure that the cumulative effect of all techniques does not compromise the overall quality of the model. Our goal is to provide a high-throughput inference service that delivers accurate and reliable results, while minimizing the computational resources required.Generation parameters
You can tune the model’s generation parameters to adapt outputs for your use case and optimize performance or cost. Nebius AI Studio supports most of OpenAI compatible generation parameters. Supported parameters vary by interface:- Playground — Supports the most commonly used parameters.
- API — Supports the full set of vLLM parameters.
How to view more model parameters
To get an extended list of model parameters, send a request that returns a list of models. Specify theverbose=true
query parameter in the request endpoint.