Cost-Engineering an LLM App From $400/day to $60/day
The starting point
Our client runs a multilingual customer‑support assistant that fields roughly 12 000 user messages per day. The initial implementation was a straight‑through call to Anthropic’s claude‑3‑sonnet‑20240229 model, with a static 1 800‑token system prompt and the last 20 turns of conversation attached to every request. No caching, no routing, no token‑budgeting. At the time the cost calculator in the provider console showed about $0.032 per 1 000 input tokens and $0.044 per 1 000 output tokens. With an average of 250 input tokens and 150 output tokens per turn, the daily bill hovered around $400. The product met the SLA for latency and accuracy, but the margin was razor‑thin.
Change 1 — small‑model‑first cascade
We profiled the traffic and found that 60 % of inbound messages were simple intent classifications: “I want a refund”, “where’s my order?”, “change my address”. Those are binary or ternary decisions that any 1‑B‑parameter model can handle. We fine‑tuned a mistral‑7b‑instruct checkpoint on a 10 k‑example intent dataset and deployed it on a managed inference endpoint (together.ai, $0.015 per 1 000 tokens). The routing layer inspected the user utterance, ran the small model, and only escalated to the frontier model when the confidence fell below 0.92 or when the request matched a “requires reasoning” tag.
Impact: frontier‑model calls dropped from 12 000 to ≈5 200 per day, a 45 % reduction. Because the small model’s per‑token price is about half of the frontier model, overall spend fell to roughly $220/day.
Change 2 — prompt caching
Anthropic’s API supports prompt caching: the system prompt is stored on the server for the lifetime of the cache key and is not charged on subsequent calls. Our system prompt was 1 800 tokens and unchanged across the entire product. By enabling caching on the API key, each cached call saved 1 800 × $0.032 ≈ $0.058. With 5 200 frontier‑model calls per day, the daily saving was $300 × 0.85 ≈ $255, bringing the frontier‑model spend down to $140.
We also memoized the embedding of the system prompt locally to avoid the round‑trip for the cache‑key generation, shaving an additional 5 ms off latency.
Change 3 — context trimming by relevance
The original implementation shipped the last 20 turns (≈ 3 000 tokens) with every request. We built a lightweight relevance filter using sentence‑transformers/all‑mpnet‑base‑v2. For each new user message we computed the cosine similarity of the embedding against the embeddings of prior turns and selected the top‑N that maximized a relevance threshold of 0.75. In practice N settled at 5–6 turns (≈ 800 tokens) for 92 % of conversations.
Result: average input token count fell from 2 500 to 1 125, a 55 % reduction. Because input tokens are cheaper than output tokens, the net effect was a further drop to $80/day without any measurable degradation on the held‑out evaluation set.
Change 4 — batching non‑real‑time workloads
Two background jobs were consuming a non‑trivial slice of the budget:
- Overnight quality scoring of 24 h conversation logs (embedding generation + classification).
- Weekly refresh of vector‑store embeddings for knowledge‑base articles.
Both jobs were originally executed via the same per‑token endpoint used for live traffic. We switched them to the batch endpoint offered by the provider, which charges 30 % less per token and allows us to send up to 10 000 requests in a single HTTP call. The batch jobs now cost roughly $15 per day, a modest but steady saving.
Change 5 — continuous evaluation gates
Every optimisation was gated by an automated regression suite. The suite consists of:
- A 2 000‑example held‑out set covering refunds, status checks, policy queries, and edge‑case escalations.
- Metrics: exact‑match intent accuracy, BLEU for free‑form responses, and a latency ceiling of 800 ms.
- A “quality delta” threshold of –2 % relative to the baseline.
Two candidate changes failed this gate: (a) aggressive summarisation of the last 20 turns using a 200‑token abstractive model, which cut latency but introduced a 4 % intent‑accuracy dip; (b) swapping the frontier model for a cheaper 70‑B competitor, which saved $10/day but caused a 3 % rise in hallucinations on policy questions. Both were rolled back and re‑engineered. The gate ensured that cost cuts never compromised the SLA.
Where the money landed
Aggregating the line‑item savings yields the following daily spend trajectory:
- Baseline (no optimisation): ≈ $400
- After small‑model cascade: ≈ $220
- After prompt caching: ≈ $140
- After context trimming: ≈ $80
- After batching: ≈ $60
The exact numbers will shift with traffic spikes, model price changes, or prompt rewrites, but the ordering of impact—cascade, caching, trimming, batching—has proved robust across three separate customer‑support deployments we have audited.
What didn’t help
We tried three ideas that either broke the product or delivered negligible savings:
- Switching to a “cheaper” frontier model. A 70‑B model priced at $0.028 per 1 000 input tokens reduced raw API cost by ~15 % but introduced edge‑case failures that required a fallback to the original model 18 % of the time, erasing the intended savings.
- Aggressive context compression. Summarising the full 20‑turn history to a 200‑token abstract looked promising on paper. In practice, the abstract omitted critical order IDs, leading to a 3 % drop in intent accuracy and a spike in escalations.
- Self‑hosting the small‑model layer too early. Running
mistral‑7b‑instructon a single 8‑GPU node (RunPod, $0.30 per GPU‑hour) cost $45/day in compute alone, plus ops overhead. Below ~500 k requests/month, a managed endpoint remains cheaper.
Scaling the pattern to other LLM products
The five‑step framework is portable:
- Traffic profiling. Quantify intent complexity, token distribution, and latency sensitivity. Tools like
prometheus+grafanaor provider‑specific usage dashboards give you the raw numbers. - Model cascade design. Pair a cheap, fine‑tuned classifier with a high‑capability reasoning model. Keep the routing logic stateless so it can be hot‑reloaded without downtime.
- Prompt engineering for cacheability. Separate static system prompts from dynamic user context. Use the provider’s cache key header or embed the static prompt in a separate API call.
- Context management. Replace “last N turns” heuristics with relevance‑based retrieval. Embedding stores such as
pineconeorweaviateadd negligible latency when indexed properly. - Evaluation‑driven iteration. Automate regression testing and enforce a hard quality floor before any cost‑saving change reaches production.
When we applied the same pipeline to a real‑time code‑assistant (≈ 3 000 requests/day, average 500‑token code snippets) we saw a 70 % cost reduction while maintaining a 95 % pass rate on the HumanEval benchmark.
Practical checklist for immediate savings
If you need quick wins, run through this list in a single sprint:
- Enable prompt caching on any static system prompt.
- Fine‑tune a ≤ 2 B model on high‑frequency intents and route low‑confidence cases to the frontier model.
- Implement a relevance filter that caps context to the top‑3 most similar turns.
- Move all non‑real‑time token‑heavy jobs to a batch endpoint or a nightly cron.
- Set up an automated eval pipeline with a <2 % quality delta threshold.
Even if you adopt only two of these steps, you can expect a 30–40 % reduction in daily spend without sacrificing user experience.
This is part of the Build cornerstone series.