- Visual understanding — describe images, answer questions about visual content, compare images, analyze charts
- OCR (Optical Character Recognition) — extract text from scanned documents, receipts, invoices, screenshots, handwritten notes, and PDFs
Available vision models
Available OCR models
We host a growing set of OCR-specialized models for high-accuracy text extraction. Browse the full OCR model catalog. OCR models currently use the same vision API format below. A dedicated OCR endpoint optimized for document processing is coming soon.Quick start
Images are passed in two ways:- URL — pass a link to a publicly accessible image
- Base64 — encode the image and include it directly in the request
Image URL
Base64 encoded image
OCR example
Extract all text from a document image:"Extract all text from this image."— basic extraction"Extract all text and return it as structured JSON with field names and values."— structured extraction (e.g. invoices, forms)"Transcribe the handwritten text in this image."— handwriting recognition"List all line items, quantities, and prices from this receipt."— targeted extraction
Multiple images
You can pass multiple images in a single request by including multipleimage_url content items:
Pricing and token counting
Images are tokenized and billed as input tokens. The number of tokens consumed by an image is reported in the response under"usage": {"prompt_tokens": ...}.
Different models work with different image resolutions. You can still pass images of any resolution — the model will rescale them automatically. Check the model’s documentation page for supported resolutions.
Limitations
- Supported image formats: jpg, png, webp
- Maximum image size: 20MB
- The
detailparameter (image fidelity) is not currently supported