Hosting Open-Source LLMs: vLLM on a $20/mo Box, Real Benchmarks

By OrionAI Build Editorial · Published 2026-05-10 · // build

Most "self-host an LLM" guides start with a $2,000/month assumption. They don't have to. Here's a working setup running a small open-weight model on rental GPU starting around $20/month, with real numbers for throughput, latency, and the things that go wrong.

What "cheap" means here

I'm targeting a model in the 1B-3B parameter range. Gemma 3 1B, Phi-4 mini, Qwen 2.5 1.5B. These aren't frontier models. They're cost-effective for narrow tasks: classification, structured extraction, prompt routing, FAQ answering. For anything reasoning-heavy you'll route to a frontier API anyway.

Hardware shape

For a 1B-3B model with vLLM, you need 6-8 GiB of VRAM. That's an entry-level consumer card or a cheap rental tier. The cheapest reliably-available rental I've used sits around $0.10–$0.20/hour for spot capacity. Run it 4-6 hours a day for personal use and you're under $20/month.

vLLM config that ships

vllm serve unsloth/gemma-3-1b-it \
  --enforce-eager \
  --max-num-seqs 4 \
  --max-model-len 1024 \
  --gpu-memory-utilization 0.92 \
  --port 8000

Notes on each flag:

--enforce-eager avoids CUDA-graph warmup. Slightly slower per request but instant cold-start, which matters more for cheap intermittent boxes.
--max-num-seqs 4 is realistic for 8 GiB VRAM. Going higher gets you OOMs under burst.
--max-model-len 1024 matches what tiny-model workloads actually need. Don't budget for 32k context if you'll never use it.
--gpu-memory-utilization 0.92 leaves headroom for the OS and the model's KV cache spikes.

Real benchmarks

Workload	Throughput	P95 latency
Single concurrent caller, 100 tokens out	~85 tok/s	~1.4s
4 concurrent callers, 100 tokens out	~210 tok/s aggregate	~2.1s
1 caller, 256 tokens out	~80 tok/s	~3.4s

These are warm-cache numbers. Cold start is dominated by model load — about 8 seconds.

What goes wrong

Spot instance pre-emption. The cheapest rentals can be reclaimed. Solve with retry-on-disconnect in the client + auto-reattach script.
OOMs on burst. Cap concurrency. A queue in front is cheaper than a bigger box.
HuggingFace rate limits on first download. Prefer a build pipeline that bakes the model weights into a Docker layer, or pull once to a persistent volume.

When this is the wrong answer

If your traffic is <500 requests/day, just use an API. Self-hosting at low utilisation is more expensive once you count your time. The crossover for me sits around 5,000+ requests/day for narrow-task workloads.

Stack summary

Model: Gemma 3 1B (Unsloth release for fast training compat).
Server: vLLM, OpenAI-compatible endpoint.
Box: 8 GiB VRAM rental, $0.10–$0.20/hr spot tier.
Client: any OpenAI SDK, base URL pointed at the box.

Model APIs — vetted picks

Anthropic OpenAI ElevenLabs Cartesia Together AI Groq

GPU & compute — vetted picks

RunPod Vast.ai Modal Replicate Lambda Labs Hetzner

Dev tools — vetted picks

Cursor Aider Continue GitHub Copilot