OpenAI-compatible chat completions API — just change the base URL and model name.
DeepInfra offers an OpenAI-compatible chat completions API for all LLM models at the best prices for open-source model inference. For other model types (embeddings, image generation, speech, reranking, and more), see More APIs. The endpoint is:
https://api.deepinfra.com/v1/openai
The only changes you need to make from your existing OpenAI code:
Set base_url to https://api.deepinfra.com/v1/openai
To create a longer conversation, include the full message history in every request. The model uses this context to provide better answers.
from openai import OpenAIopenai = OpenAI( api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com/v1/openai",)chat_completion = openai.chat.completions.create( model="deepseek-ai/DeepSeek-V3", messages=[ {"role": "system", "content": "Respond like a michelin starred chef."}, {"role": "user", "content": "Can you name at least two different techniques to cook lamb?"}, {"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form..."}, {"role": "user", "content": "Tell me more about the second method."}, ],)print(chat_completion.choices[0].message.content)
The longer the conversation, the more tokens it uses. The maximum conversation length is determined by the model’s context size.
Set service_tier to "priority" to request priority inference on supported models. Priority requests get faster time-to-first-token and higher throughput during peak demand.
Priority inference incurs a 20% surcharge on top of the model’s standard per-token price.
The response includes a service_tier field confirming which tier was used. Not all models support priority tiers — check the model page for availability.