Hosting Open-Source LLMs: vLLM on a $20/mo Box, Real Benchmarks
Most "self-host an LLM" guides start with a $2,000/month assumption. They don't have to. Here's a working setup running a small open-weight model on rental GPU starting around $20/month, with real numbers for throughput, latency, and the things that go wrong.
What "cheap" means here
I'm targeting a model in the 1B-3B parameter range. Gemma 3 1B, Phi-4 mini, Qwen 2.5 1.5B. These aren't frontier models. They're cost-effective for narrow tasks: classification, structured extraction, prompt routing, FAQ answering. For anything reasoning-heavy you'll route to a frontier API anyway.
Hardware shape
For a 1B-3B model with vLLM, you need 6-8 GiB of VRAM. That's an entry-level consumer card or a cheap rental tier. The cheapest reliably-available rental I've used sits around $0.10–$0.20/hour for spot capacity. Run it 4-6 hours a day for personal use and you're under $20/month.
vLLM config that ships
vllm serve unsloth/gemma-3-1b-it \
--enforce-eager \
--max-num-seqs 4 \
--max-model-len 1024 \
--gpu-memory-utilization 0.92 \
--port 8000
Notes on each flag:
--enforce-eageravoids CUDA-graph warmup. Slightly slower per request but instant cold-start, which matters more for cheap intermittent boxes.--max-num-seqs 4is realistic for 8 GiB VRAM. Going higher gets you OOMs under burst.--max-model-len 1024matches what tiny-model workloads actually need. Don't budget for 32k context if you'll never use it.--gpu-memory-utilization 0.92leaves headroom for the OS and the model's KV cache spikes.
Real benchmarks
| Workload | Throughput | P95 latency |
|---|---|---|
| Single concurrent caller, 100 tokens out | ~85 tok/s | ~1.4s |
| 4 concurrent callers, 100 tokens out | ~210 tok/s aggregate | ~2.1s |
| 1 caller, 256 tokens out | ~80 tok/s | ~3.4s |
These are warm-cache numbers. Cold start is dominated by model load — about 8 seconds.
What goes wrong
- Spot instance pre-emption. The cheapest rentals can be reclaimed. Solve with retry-on-disconnect in the client + auto-reattach script.
- OOMs on burst. Cap concurrency. A queue in front is cheaper than a bigger box.
- HuggingFace rate limits on first download. Prefer a build pipeline that bakes the model weights into a Docker layer, or pull once to a persistent volume.
When this is the wrong answer
If your traffic is <500 requests/day, just use an API. Self-hosting at low utilisation is more expensive once you count your time. The crossover for me sits around 5,000+ requests/day for narrow-task workloads.
Stack summary
- Model: Gemma 3 1B (Unsloth release for fast training compat).
- Server: vLLM, OpenAI-compatible endpoint.
- Box: 8 GiB VRAM rental, $0.10–$0.20/hr spot tier.
- Client: any OpenAI SDK, base URL pointed at the box.