Skip to main content
DeepInfra hosts Whisper and other speech recognition models. Given an audio file, they produce transcribed text with per-sentence timestamps. Browse all speech recognition models.

Models

  • openai/whisper-large — best accuracy
  • openai/whisper-medium, openai/whisper-small, openai/whisper-base — faster, lighter
  • openai/whisper-timestamped-medium — per-word timestamp segmentation

Example

curl -X POST \
    -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
    -F audio=@audio.mp3 \
    'https://api.deepinfra.com/v1/inference/openai/whisper-large'

Supported audio formats

  • mp3
  • wav

Response

{
  "text": "Hello, this is a transcription of the audio file.",
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, this is a transcription of the audio file."
    }
  ]
}

Additional parameters

Each model exposes different parameters (language, task, etc.). Check the model’s API documentation page for details.

Tutorial

See the Whisper tutorial for a complete walkthrough.