Skip to main content
POST
/
v1
/
completions
Create completion
curl --request POST \
  --url https://api.studio.nebius.com/v1/completions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
  "prompt": "Say this is a test",
  "stream": true,
  "stream_options": null,
  "max_tokens": 100,
  "temperature": 123,
  "top_p": 123,
  "n": 123,
  "logprobs": 123,
  "echo": true,
  "stop": "<string>",
  "presence_penalty": 123,
  "frequency_penalty": 123,
  "logit_bias": {},
  "user": "<string>",
  "extra_body": null,
  "service_tier": "auto"
}'
{
  "id": "cmpl-bd18c4194f544c189578cfcb273a2f74",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?",
        "role": "assistant"
      }
    }
  ],
  "created": 1717516032,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 26,
    "prompt_tokens": 13,
    "total_tokens": 39
  }
}

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Query Parameters

ai_project_id
string | null

current project ID

Body

application/json
model
string
required

ID of the model to use.

Examples:

"meta-llama/Meta-Llama-3.1-70B-Instruct"

prompt
required

The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays.

Examples:

"Say this is a test"

stream
boolean | null
default:false

Enable response streaming.

stream_options
object | null

If set to {"include_usage": True}, usage stats will be sent with the last chunk of data

Examples:

null

max_tokens
integer | null

Max completion token count

Examples:

100

temperature
number | null
default:1

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

top_p
number | null
default:1

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

n
integer | null
default:1

How many completions to generate for each prompt.

logprobs
integer | null

Include the log probabilities on the logprobs most likely tokens, as well the chosen tokens. So for example, if logprobs is 5, the API will return a list of the 5 most likely tokens. The API will always return the logprob of the sampled token, so there may be up to logprobs+1 elements in the response.

echo
boolean | null
default:false

Echo back the prompt in addition to the completion.

stop

Up to 4 sequences where the API will stop generating further tokens.

presence_penalty
number | null
default:0

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

frequency_penalty
number | null
default:0

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

logit_bias
object | null

Modify the likelihood of specified tokens appearing in the completion. Accepts a json object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.

user
string | null

A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse.

extra_body
object | null

To provide extra parameters.

Examples:

null

service_tier
enum<string> | null
default:auto

The service tier to use for the request. Represents the service tier for requests.

Attributes: Auto: Automatically choose the best available tier for the request (Default or OverLimit). Analyze response to determine which tier was used. Default: Return 429 errors on hitting the rate limit, do not exceed to the OverLimit tier. OverLimit: Indicate that the request was over the user limit. This tier cannot be set by user in the request, but us used in a response for tier=Auto. Flex: Do not consume rate-limit credits, but run with lower priority. May still result in 429 errors in case of if there is no resources to process.

Available options:
auto,
default,
over-limit,
flex
Examples:

"auto"

"flex"

Response

OK

id
string
required

A unique identifier for the chat completion.

object
string
required

The object type, which is always text_completion.

created
integer
required

The Unix timestamp of when the completion was created.

model
string
required

The model used for the chat completion.

choices
CompletionChoice · object[]
required

A list of completion choices.

usage
object
required

Usage statistics for the completion request.

service_tier
enum<string>
required

The service tier used for the request. Represents the service tier for requests.

Attributes: Auto: Automatically choose the best available tier for the request (Default or OverLimit). Analyze response to determine which tier was used. Default: Return 429 errors on hitting the rate limit, do not exceed to the OverLimit tier. OverLimit: Indicate that the request was over the user limit. This tier cannot be set by user in the request, but us used in a response for tier=Auto. Flex: Do not consume rate-limit credits, but run with lower priority. May still result in 429 errors in case of if there is no resources to process.

Available options:
auto,
default,
over-limit,
flex