Skip to main content
DeepInfra allows you to deploy your own models on dedicated infrastructure — your weights, your endpoint, your isolation.

Why run private models?

  • Compliance — data stays on dedicated infrastructure, not shared with other users
  • Custom weights — deploy fine-tuned or trained-from-scratch models
  • Predictable latency — no sharing with other users means consistent response times
  • Autoscaling — scale from 0 to many instances automatically based on load
  • Competitive GPU pricing — some of the lowest per-GPU-hour rates available, with no lock-in
  • Simple deployment — up and running in just a couple of clicks from the dashboard

What you can deploy

GPU options

Private model deployments run on:
  • A100-80GB — proven workhorse for LLM inference, great value
  • H100-80GB — fast and widely supported
  • H200-141GB — large HBM3e memory, ideal for big models
  • B200-180GB — NVIDIA Blackwell, significantly faster for inference workloads
  • B300-288GB — latest NVIDIA Blackwell Ultra, highest performance available

Pricing model

Unlike shared inference (pay per token), private deployments are billed per GPU-hour. You pay for the time your GPUs are running, regardless of traffic.
Leaving a custom deployment running by mistake can rack up costs quickly. For example, forgetting to shut down a 2-GPU deployment over a weekend (64 hours) costs ~$256 USD. Always set spending limits in payment settings.

Getting started

  1. Go to Dashboard → Deployments
  2. Click New Deployment
  3. Choose your deployment type (Custom LLM, LoRA, or LoRA Image)
  4. Fill in the configuration and deploy
See the specific guides for each deployment type: