Need unlimited throughput?
If you have sustained workloads that dwarf the standard limits, reach out to us to upgrade to the Enterprise Tier. Enterprise removes the soft caps described below and gives you dedicated capacity plus an SLA.
If you have sustained workloads that dwarf the standard limits, reach out to us to upgrade to the Enterprise Tier. Enterprise removes the soft caps described below and gives you dedicated capacity plus an SLA.
How it works
Key idea – Limits are dynamic. If your app stays close to the ceiling, we automatically raise it. If traffic falls off, we scale it back. Every user is provided with a default rate limit cap. This limit grows automatically when you consistently operate close to your capacity and current limit.You can find the defaults at Rate Limits section of Nebius AI studio Nebius AI Studio
- Scale‑up rule – When average usage in a 15‑minute window ≥ 80 % of the current limit, the limit for the next window increases by 20 % (
× 1.2
). - Scale‑down rule – When average usage in a 15‑minute window ≤ 50 %, the limit for the next window decreases by one‑third (
÷ 1.5
). - Hard ceiling – The limit can grow to 20 × your base allocation. Beyond that we require an Enterprise plan.
Window | Scale factor | RPM limit | TPM limit |
---|---|---|---|
Baseline | 1.00× | 60 | 400,000 |
+ 15 min | 1.20× | 72 | 480,000 |
+ 30 min | 1.44× | 86 | 576,000 |
+ 1 h | 2.07× | 124 | 828,000 |
+ 2 h | 4.30× | 258 | 1,720,000 |
Monitoring your headroom and handling 429s
Your current rate limits are always visible in the API response headers, which are detailed further down. If you exceed the active limit you will receive HTTP 429. To avoid this, you can track your consumption against your limits using thex-ratelimit-remaining-requests
and x-ratelimit-remaining-tokens
response headers.
Use Batch API for async workloads - it has significantly higher limits. Read more