DeepInfra offers an OpenAI-compatible chat completions API for all LLM models at the best prices for open-source model inference. For other model types (embeddings, image generation, speech, reranking, and more), see More APIs. The endpoint is:
https://api.deepinfra.com/v1/openai
The only changes you need to make from your existing OpenAI code:
- Set
base_url to https://api.deepinfra.com/v1/openai
- Set
api_key to your DeepInfra token
- Set
model to a model from our catalog
Install the SDK
Basic chat completion
from openai import OpenAI
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
Multi-turn conversations
To create a longer conversation, include the full message history in every request. The model uses this context to provide better answers.
from openai import OpenAI
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[
{"role": "system", "content": "Respond like a michelin starred chef."},
{"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
{"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form..."},
{"role": "user", "content": "Tell me more about the second method."},
],
)
print(chat_completion.choices[0].message.content)
The longer the conversation, the more tokens it uses. The maximum conversation length is determined by the model’s context size.
Supported parameters
| Parameter | Notes |
|---|
model | Model name, or MODEL_NAME:VERSION, or deploy_id:DEPLOY_ID |
messages | Roles: system, user, assistant |
max_tokens | |
stream | See Streaming |
temperature | |
top_p | |
stop | |
n | |
presence_penalty | |
frequency_penalty | |
response_format | See Structured Outputs |
tools, tool_choice | See Tool Calling |
service_tier | Priority inference for tagged models. See Service Tier below. |
reasoning_effort | Controls reasoning depth for reasoning models. See Reasoning Models. |
We may not be 100% compatible with all OpenAI parameters. Let us know on Discord or by email if something you need is missing.
Service tier
Set service_tier to "priority" to request priority inference on supported models. Priority requests get faster time-to-first-token and higher throughput during peak demand.
Priority inference incurs a 20% surcharge on top of the model’s standard per-token price.
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[{"role": "user", "content": "Hello!"}],
extra_body={"service_tier": "priority"},
)
The response includes a service_tier field confirming which tier was used. Not all models support priority tiers — check the model page for availability.
What’s next