Skip to main content
DeepInfra allows you to deploy your own models on dedicated infrastructure — your weights, your endpoint, your isolation.

Why run private models?

  • Compliance — data stays on dedicated infrastructure, not shared with other users
  • Custom weights — deploy fine-tuned or trained-from-scratch models
  • Predictable latency — no sharing with other users means consistent response times
  • Autoscaling — scale from 0 to many instances automatically based on load
  • Competitive GPU pricing — some of the lowest per-GPU-hour rates available, with no lock-in
  • Simple deployment — up and running in just a couple of clicks from the dashboard

What you can deploy

Custom LLMs

Deploy any Hugging Face LLM on A100/H100/H200/B200/B300 GPUs with the OpenAI-compatible API.

LoRA Adapters

Deploy LoRA fine-tuned language models on top of supported base models.

LoRA Image Models

Deploy LoRA adapters for image generation from Civitai.

GPU options

Private model deployments run on:
  • A100-80GB — proven workhorse for LLM inference, great value
  • H100-80GB — fast and widely supported
  • H200-141GB — large HBM3e memory, ideal for big models
  • B200-180GB — NVIDIA Blackwell, significantly faster for inference workloads
  • B300-288GB — latest NVIDIA Blackwell Ultra, highest performance available

Pricing model

Unlike shared inference (pay per token), private deployments are billed per GPU-hour. You pay for the time your GPUs are running, regardless of traffic.
Leaving a custom deployment running by mistake can rack up costs quickly. For example, forgetting to shut down a 2-GPU deployment over a weekend (64 hours) costs ~$256 USD. Always set spending limits in payment settings.

Getting started

  1. Go to Dashboard → Deployments
  2. Click New Deployment
  3. Choose your deployment type (Custom LLM, LoRA, or LoRA Image)
  4. Fill in the configuration and deploy
See the specific guides for each deployment type: