Vision & OCR - DeepInfra

DeepInfra hosts multimodal models that accept both images and text as input and produce text output. These models use the standard OpenAI vision API format and cover two major use cases:

Visual understanding — describe images, answer questions about visual content, compare images, analyze charts
OCR (Optical Character Recognition) — extract text from scanned documents, receipts, invoices, screenshots, handwritten notes, and PDFs

Available vision models

Available OCR models

We host a growing set of OCR-specialized models for high-accuracy text extraction. Browse the full OCR model catalog. OCR models currently use the same vision API format below. A dedicated OCR endpoint optimized for document processing is coming soon.

Quick start

Images are passed in two ways:

URL — pass a link to a publicly accessible image
Base64 — encode the image and include it directly in the request

Image URL

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-32B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
            }
          },
          {
            "type": "text",
            "text": "What'\''s in this image?"
          }
        ]
      }
    ]
  }'

Base64 encoded image

from openai import OpenAI
import base64
import requests

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

image_url = "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
base64_image = base64.b64encode(requests.get(image_url).content).decode("utf-8")

chat_completion = openai.chat.completions.create(
    model="Qwen/Qwen2.5-VL-32B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                },
                {
                    "type": "text",
                    "text": "What's in this image?"
                }
            ]
        }
    ]
)

print(chat_completion.choices[0].message.content)

OCR example

Extract all text from a document image:

from openai import OpenAI
import base64

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

with open("invoice.png", "rb") as f:
    base64_image = base64.b64encode(f.read()).decode("utf-8")

response = openai.chat.completions.create(
    model="Qwen/Qwen2.5-VL-32B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"}
                },
                {
                    "type": "text",
                    "text": "Extract all text from this document. Preserve the structure and layout as much as possible."
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

Common OCR prompts:

"Extract all text from this image." — basic extraction
"Extract all text and return it as structured JSON with field names and values." — structured extraction (e.g. invoices, forms)
"Transcribe the handwritten text in this image." — handwriting recognition
"List all line items, quantities, and prices from this receipt." — targeted extraction

Multiple images

You can pass multiple images in a single request by including multiple image_url content items:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-32B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {"url": "https://example.com/page1.jpg"}
          },
          {
            "type": "image_url",
            "image_url": {"url": "https://example.com/page2.jpg"}
          },
          {
            "type": "text",
            "text": "Extract all text from both pages."
          }
        ]
      }
    ]
  }'

Pricing and token counting

Images are tokenized and billed as input tokens. The number of tokens consumed by an image is reported in the response under "usage": {"prompt_tokens": ...}. Different models work with different image resolutions. You can still pass images of any resolution — the model will rescale them automatically. Check the model’s documentation page for supported resolutions.

Limitations

Supported image formats: jpg, png, webp
Maximum image size: 20MB
The detail parameter (image fidelity) is not currently supported

​Available vision models

​Available OCR models

​Quick start

​Image URL

​Base64 encoded image

​OCR example

​Multiple images

​Pricing and token counting

​Limitations