With Nebius AI Studio, you can test and deploy open-source AI models for generative language processing tasks, such as building virtual assistants and chatbots.

Available models

Nebius AI Studio supports the following model types:
  • Text-to-text
  • Embedding (feature extraction)
  • Text-to-image
  • Safety guardrails
  • Vision (image-text-to-text)
See the full list of available models in Nebius AI Studio.

Model flavors

Inference performance is determined by how the endpoint is configured—balancing:
  • Batch size
  • GPU type (L40, H100, H200, B200)
  • GPU count
  • Inference-time optimizations (e.g., speculative decoding)
To cover common usage patterns,two model flavors are available:

Base

Fast

Both deliver identical model outputs. Differences lie in token pricing, latency, and the level of applied optimizations. The Fast flavor uses smaller batch sizes, increased compute allocation, and techniques such as speculative decoding to reduce latency and improve responsiveness. To use the Fast flavor, append -fast to the model name in the API. These flavors are curated based on infrastructure and model-level expertise. For more specific needs—such as strict latency budgets, cost-efficiency targets, or high-throughput applications—endpoint configurations can be tailored to the customer use case through direct collaboration with the sales and solutions engineering team.

Inference optimisations

Our LLM inference service employs a range of optimization techniques to increase throughput while maintaining model quality. These techniques include
  1. KV cache: A caching mechanism that stores frequently accessed key-value pairs, reducing the number of computations required.
  2. Paged attention: A technique that divides the input sequence into smaller chunks, processing each chunk separately to reduce memory usage and computation.
  3. Flash attention: A modified attention mechanism that reduces the number of computations required for attention calculations.
  4. Quantization: A technique that reduces the precision of model weights and activations, decreasing memory usage and computation.
  5. Continuous batching: A method that batches multiple input sequences together, increasing throughput by reducing the overhead of individual requests.
  6. Context caching: A caching mechanism that stores the context (i.e., the output of previous layers) for each input sequence, reducing the number of computations required.
  7. Speculative decoding: An advanced technique that uses auxiliary models to pre-generate likely next tokens. The main model verifies these predictions, reducing the number of forward passes required and improving throughput without degrading output quality.

Impact on model quality

Our optimization techniques are designed to minimize the impact on model quality. Through extensive testing and evaluation, we have found that our optimized models maintain approximately 99% of the original model’s quality. This means that the optimized models produce nearly identical results to the original models, with only minor differences in output.The quality impact of each optimization technique is carefully evaluated and monitored to ensure that the cumulative effect of all techniques does not compromise the overall quality of the model. Our goal is to provide a high-throughput inference service that delivers accurate and reliable results, while minimizing the computational resources required.

Generation parameters

You can tune the model’s generation parameters to adapt outputs for your use case and optimize performance or cost. Nebius AI Studio supports most of OpenAI compatible generation parameters. Supported parameters vary by interface:
  • Playground — Supports the most commonly used parameters.
  • API — Supports the full set of vLLM parameters.

How to view more model parameters

To get an extended list of model parameters, send a request that returns a list of models. Specify the verbose=true query parameter in the request endpoint.