Skip to main content
DeepInfra offers an OpenAI-compatible chat completions API for all LLM models at the best prices for open-source model inference. For other model types (embeddings, image generation, speech, reranking, and more), see More APIs. The endpoint is:
https://api.deepinfra.com/v1/openai
The only changes you need to make from your existing OpenAI code:
  1. Set base_url to https://api.deepinfra.com/v1/openai
  2. Set api_key to your DeepInfra token
  3. Set model to a model from our catalog

Install the SDK

pip install openai

Basic chat completion

from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

Multi-turn conversations

To create a longer conversation, include the full message history in every request. The model uses this context to provide better answers.
from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": "Respond like a michelin starred chef."},
        {"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
        {"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form..."},
        {"role": "user", "content": "Tell me more about the second method."},
    ],
)

print(chat_completion.choices[0].message.content)
The longer the conversation, the more tokens it uses. The maximum conversation length is determined by the model’s context size.

Supported parameters

ParameterNotes
modelModel name, or MODEL_NAME:VERSION, or deploy_id:DEPLOY_ID
messagesRoles: system, user, assistant
max_tokens
streamSee Streaming
temperature
top_p
stop
n
presence_penalty
frequency_penalty
response_formatSee Structured Outputs
tools, tool_choiceSee Tool Calling
service_tierPriority inference for tagged models. See Service Tier below.
reasoning_effortControls reasoning depth for reasoning models. See Reasoning Models.
We may not be 100% compatible with all OpenAI parameters. Let us know on Discord or by email if something you need is missing.

Service tier

Set service_tier to "priority" to request priority inference on supported models. Priority requests get faster time-to-first-token and higher throughput during peak demand.
Priority inference incurs a 20% surcharge on top of the model’s standard per-token price.
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"service_tier": "priority"},
)
The response includes a service_tier field confirming which tier was used. Not all models support priority tiers — check the model page for availability.

What’s next