OrionAI Build logo orionai.build

Hosting Open-Source LLMs: vLLM on a $20/mo Box, Real Benchmarks

By OrionAI Build Editorial · Published 2026-05-10 · // build

Most "self-host an LLM" guides start with a $2,000/month assumption. They don't have to. Here's a working setup running a small open-weight model on rental GPU starting around $20/month, with real numbers for throughput, latency, and the things that go wrong.

What "cheap" means here

I'm targeting a model in the 1B-3B parameter range. Gemma 3 1B, Phi-4 mini, Qwen 2.5 1.5B. These aren't frontier models. They're cost-effective for narrow tasks: classification, structured extraction, prompt routing, FAQ answering. For anything reasoning-heavy you'll route to a frontier API anyway.

Hardware shape

For a 1B-3B model with vLLM, you need 6-8 GiB of VRAM. That's an entry-level consumer card or a cheap rental tier. The cheapest reliably-available rental I've used sits around $0.10–$0.20/hour for spot capacity. Run it 4-6 hours a day for personal use and you're under $20/month.

vLLM config that ships

vllm serve unsloth/gemma-3-1b-it \
  --enforce-eager \
  --max-num-seqs 4 \
  --max-model-len 1024 \
  --gpu-memory-utilization 0.92 \
  --port 8000

Notes on each flag:

Real benchmarks

WorkloadThroughputP95 latency
Single concurrent caller, 100 tokens out~85 tok/s~1.4s
4 concurrent callers, 100 tokens out~210 tok/s aggregate~2.1s
1 caller, 256 tokens out~80 tok/s~3.4s

These are warm-cache numbers. Cold start is dominated by model load — about 8 seconds.

What goes wrong

  1. Spot instance pre-emption. The cheapest rentals can be reclaimed. Solve with retry-on-disconnect in the client + auto-reattach script.
  2. OOMs on burst. Cap concurrency. A queue in front is cheaper than a bigger box.
  3. HuggingFace rate limits on first download. Prefer a build pipeline that bakes the model weights into a Docker layer, or pull once to a persistent volume.

When this is the wrong answer

If your traffic is <500 requests/day, just use an API. Self-hosting at low utilisation is more expensive once you count your time. The crossover for me sits around 5,000+ requests/day for narrow-task workloads.

Stack summary

Model APIs — vetted picks
GPU & compute — vetted picks
Dev tools — vetted picks