Overview
Benefits:- Predictable response times (no sharing with other users)
- Autoscaling support
- Run your own fine-tuned or trained-from-scratch model
- Full OpenAI API compatibility
- Billed per GPU-hour, not per token — you need sufficient load to justify the cost
Deployment configuration
A deployment has fixed parameters:| Parameter | Description |
|---|---|
model_name | Name used for inference calls |
gpu | A100-80GB, H100-80GB, H200-141GB, B200-180GB, B300-288GB (and more) |
num_gpus | Number of GPUs (model weights must fit with room for KV cache) |
max_batch_size | Max parallel requests; additional requests are queued |
weights | Hugging Face repo (public or private) |
| Setting | Description |
|---|---|
min_instances | Minimum running copies (0 = scale to zero) |
max_instances | Maximum copies during high load |
Create a deployment
Web UI
Go to Dashboard → New Deployment → Custom LLM.HTTP API
YOUR_GITHUB_USERNAME/model-name.
Monitor your deployment
Track status via the Dashboard → Deployments or via HTTP:Use your deployment
Once running, inference via:- Web demo:
https://deepinfra.com/FULLNAME - OpenAI ChatCompletions API
- OpenAI Completions API
- DeepInfra inference API
deploy_id before the model is running:
Update scaling settings
Delete a deployment
- Use the trash icon in Dashboard → Deployments
- Or:
DELETE https://api.deepinfra.com/deploy/DEPLOY_ID
Limitations
- 4 GPU limit per user (e.g., 4×1GPU or 1×4GPU). Contact us for more.
- GPU availability is not guaranteed during scale-up — you’re only billed for what runs
- Billing happens weekly in a separate invoice
- Quantization is not currently supported (in progress)
deploy_idmay not be immediately available while the model is deploying