How it works
When you send a request, DeepInfra checks whether the beginning of your prompt matches a cached prefix from a recent request on the same model. If it does, the cached KV state is reused instead of recomputing it, which:- Reduces time-to-first-token — the model skips processing the cached portion
- Lowers cost — cached input tokens are billed at a reduced rate
Usage
Prompt caching is automatic — no extra parameters required. Just structure your prompts so that the reused content appears at the beginning.Best practices
Put stable content first. The cache matches from the beginning of the prompt. Place your system prompt, documents, and few-shot examples before the user’s message. Keep the prefix identical. Even a single character difference will invalidate the cache. Avoid dynamic content (timestamps, user IDs, etc.) in the cacheable prefix. Longer prefixes save more. Prompt caching is most effective with long, repeated prefixes — think multi-page documents, long system prompts, or RAG context.Common use cases
| Use case | Cached prefix |
|---|---|
| Chatbot with a long system prompt | System prompt |
| RAG / document Q&A | Retrieved documents |
| Few-shot classification | Examples |
| Code assistant with a large codebase | Codebase context |
| Multi-turn conversation | Previous turns |
Checking cache usage
The response usage object indicates how many tokens were served from cache:Explicit cache keys
By default, caching is automatic based on prefix matching. Theprompt_cache_key parameter lets you explicitly tag a request with a cache key, improving cache hit rates when your prompts share the same logical content but differ slightly in formatting or ordering.
We recommend using a session-scoped key like userid-chatsessionid (e.g. "user123-chat456"). Within a single chat session, the conversation history grows incrementally — each new request reuses all previous turns plus one new message. A per-session cache key ensures these near-identical prompts always hit the cache.
prompt_cache_key and model will share a KV cache, even if their prompt prefixes aren’t byte-for-byte identical.
| Parameter | Type | Description |
|---|---|---|
prompt_cache_key | string | An explicit key for cache lookup. Requests with the same key and model share a KV cache. |
Notes
- Prompt caching is available on supported models — check the model page for details
- Cache entries expire after a period of inactivity
- Caches are per-model and per-account
- When using
prompt_cache_key, the key is scoped per-model and per-account