Nebius AI Inference is elastic by design. When you gradually increase your workload we scale the capacity with you. Most teams on Startup Tier never need to think about rate limits—but when you start pushing tens of thousands of requests per minute it helps to know how the platform behaves. Sudden bursts of AI Model traffic may be subject to throttling. This page explains the dynamic rate‑limiting system, how it auto‑scales, and the knobs you can turn when you need more headroom.
Need unlimited throughput?
If you have sustained workloads that dwarf the standard limits, reach out to us to upgrade to the Enterprise Tier. Enterprise removes the soft caps described below and gives you dedicated capacity plus an SLA.

How it works

Key idea – Limits are dynamic. If your app stays close to the ceiling, we automatically raise it. If traffic falls off, we scale it back. Every user is provided with a default rate limit cap. This limit grows automatically when you consistently operate close to your capacity and current limit.
You can find the defaults at Rate Limits section of Nebius AI studio Nebius AI Studio
We evaluate usage in rolling 15‑minute buckets:
  • Scale‑up rule – When average usage in a 15‑minute window ≥ 80 % of the current limit, the limit for the next window increases by 20 % (× 1.2).
  • Scale‑down rule – When average usage in a 15‑minute window ≤ 50 %, the limit for the next window decreases by one‑third (÷ 1.5).
  • Hard ceiling – The limit can grow to 20 × your base allocation. Beyond that we require an Enterprise plan.
Visual example, assuming you continue to drive ≥ 80 % utilisation each window:
WindowScale factorRPM limitTPM limit
Baseline1.00×60400,000
+ 15 min1.20×72480,000
+ 30 min1.44×86576,000
+ 1 h2.07×124828,000
+ 2 h4.30×2581,720,000

Monitoring your headroom and handling 429s

Your current rate limits are always visible in the API response headers, which are detailed further down. If you exceed the active limit you will receive HTTP 429. To avoid this, you can track your consumption against your limits using the x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens response headers.
Use Batch API for async workloads - it has significantly higher limits. Read more

Over-the-Limit Processing

In some cases, requests that exceed your rate limit may still be processed. This occurs when the system has spare capacity available, and your request can be handled with a lower priority without affecting other users. These responses include:
x-ratelimit-over-limit: yes
This indicates that while your request was successful, you are over your nominal limit, and subsequent requests may be throttled if you continue to exceed your quota. Treat this as an early warning—subsequent calls may be throttled.

Rate Limit Response Headers

Here’s a simple explanation of response headers you can use to track you limits quotas:

High‑throughput async workloads? Use the Batch API

Asynchronous / batch inference is optimized for bulk jobs, utilises spare capacity and enjoys significantly higher base limits and 50% price discount. Read more..