Chat Completions

DeepInfra offers an OpenAI-compatible chat completions API for all LLM models at the best prices for open-source model inference. For other model types (embeddings, image generation, speech, reranking, and more), see More APIs. The endpoint is:

https://api.deepinfra.com/v1/openai

The only changes you need to make from your existing OpenAI code:

Set base_url to https://api.deepinfra.com/v1/openai
Set api_key to your DeepInfra token
Set model to a model from our catalog

Install the SDK

pip install openai

Basic chat completion

from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

Multi-turn conversations

To create a longer conversation, include the full message history in every request. The model uses this context to provide better answers.

from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": "Respond like a michelin starred chef."},
        {"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
        {"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form..."},
        {"role": "user", "content": "Tell me more about the second method."},
    ],
)

print(chat_completion.choices[0].message.content)

The longer the conversation, the more tokens it uses. The maximum conversation length is determined by the model’s context size.

Supported parameters

Parameter	Notes
`model`	Model name, or `MODEL_NAME:VERSION`, or `deploy_id:DEPLOY_ID`
`messages`	Roles: `system`, `user`, `assistant`
`max_tokens`
`stream`	See Streaming
`temperature`
`top_p`
`stop`
`n`
`presence_penalty`
`frequency_penalty`
`response_format`	See Structured Outputs
`tools`, `tool_choice`	See Tool Calling
`service_tier`	Priority inference for tagged models. See Service Tier below.
`reasoning_effort`	Controls reasoning depth for reasoning models. See Reasoning Models.

We may not be 100% compatible with all OpenAI parameters. Let us know on Discord or by email if something you need is missing.

Service tier

Set service_tier to "priority" to request priority inference on supported models. Priority requests get faster time-to-first-token and higher throughput during peak demand.

Priority inference incurs a 20% surcharge on top of the model’s standard per-token price.

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"service_tier": "priority"},
)

The response includes a service_tier field confirming which tier was used. Not all models support priority tiers — check the model page for availability.

What’s next

Streaming

Stream tokens as they’re generated.

Structured Outputs

Get responses in JSON format.

Tool Calling

Give models access to external functions.

Vision

Send images alongside text.

Reasoning Models

Control chain-of-thought reasoning behavior.

Getting Started

More APIs

Deploy Private Models

GPU Clusters

Integrations

Account & Security

Tutorials

Chat Completions

Install the SDK

Basic chat completion

Multi-turn conversations

Supported parameters

Service tier

What’s next

Streaming

Structured Outputs

Tool Calling

Vision

Reasoning Models

Getting Started

Chat Completions

More APIs

Deploy Private Models

GPU Clusters

Integrations

Account & Security

Tutorials

​Install the SDK

​Basic chat completion

​Multi-turn conversations

​Supported parameters

​Service tier

​What’s next

Streaming

Structured Outputs

Tool Calling

Vision

Reasoning Models

Install the SDK

Basic chat completion

Multi-turn conversations

Supported parameters

Service tier

What’s next