Skip to main content
DeepInfra offers an OpenAI-compatible chat completions API for all LLM models at the best prices for open-source model inference. For other model types (embeddings, image generation, speech, reranking, and more), see More APIs. The endpoint is:
https://api.deepinfra.com/v1/openai
The only changes you need to make from your existing OpenAI code:
  1. Set base_url to https://api.deepinfra.com/v1/openai
  2. Set api_key to your DeepInfra token
  3. Set model to a model from our catalog

Install the SDK

pip install openai

Basic chat completion

from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

Multi-turn conversations

To create a longer conversation, include the full message history in every request. The model uses this context to provide better answers.
from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": "Respond like a michelin starred chef."},
        {"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
        {"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form..."},
        {"role": "user", "content": "Tell me more about the second method."},
    ],
)

print(chat_completion.choices[0].message.content)
The longer the conversation, the more tokens it uses. The maximum conversation length is determined by the model’s context size.

Supported parameters

ParameterNotes
modelModel name, or MODEL_NAME:VERSION, or deploy_id:DEPLOY_ID
messagesRoles: system, user, assistant
max_tokens
streamSee Streaming
temperature
top_p
stop
n
presence_penalty
frequency_penalty
response_formatSee Structured Outputs
tools, tool_choiceSee Tool Calling
service_tierPriority inference for tagged models. See Service Tier below.
reasoning_effortControls reasoning depth for reasoning models. See Reasoning Models.
We may not be 100% compatible with all OpenAI parameters. Let us know on Discord or by email if something you need is missing.

Service tier

Set service_tier to "priority" to request priority inference on supported models. Priority requests get faster time-to-first-token and higher throughput during peak demand.
Priority inference incurs a 20% surcharge on top of the model’s standard per-token price.
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"service_tier": "priority"},
)
The response includes a service_tier field confirming which tier was used. Not all models support priority tiers — check the model page for availability.

What’s next

Streaming

Stream tokens as they’re generated.

Structured Outputs

Get responses in JSON format.

Tool Calling

Give models access to external functions.

Vision

Send images alongside text.

Reasoning Models

Control chain-of-thought reasoning behavior.