How long until this advice goes stale?

Structural advice (patterns, decision criteria) is durable. Pricing and benchmark numbers change often — every price figure on this page is paired with a link to the vendor's live pricing page so it can be refreshed without rewriting prose.

Are these real numbers?

Where we cite a specific cost, it's either from a vendor's published pricing page (linked in source) or from a real production run we did. We do not invent numbers — and where we don't have a number, we say so.

Do you have a referral relationship with these vendors?

Some links are affiliate links, marked rel="sponsored nofollow" and tagged with data-aff. We only link to tools we'd use ourselves. Full disclosure at /disclosure/.

Build a PR Review Agent with Claude Code

By OrionAI Build Editorial · Published 2026-05-10 · // build

This is the working build write-up for Build a PR Review Agent with Claude Code. The structure is the same one I use when shipping for paying customers: scope, stack, walkthrough, edge cases, costs, monitoring.

// scope

One concrete deliverable. No "and then you could also..." scope creep. If a feature isn't in this scope, it isn't in this build.

Stack at a glance

Model — picked for the dominant task in this build (see the Model Picking section below).
Glue — minimal: a small Python or TypeScript service, a queue if needed, a key-value store.
Hosting — whatever you already use. The build avoids platform-specific lock-in.
Eval — yes, even for a 4-hour build. Without an eval the next change is a guess.

Walkthrough

Step 1 — define the unhappy path first

Before any code, write down what "wrong" looks like for this build. For an agent, that's the user requests it should refuse, the inputs that should pivot to a human, and the outputs that should never appear. Pin those in a small eval set. Three to five examples is enough for the first pass.

Step 2 — minimum prompt

Start from the smallest possible system prompt that passes the unhappy-path eval. Anything beyond that is a future-you problem, not a today-you problem. Resist the urge to "just add a section about ..." until the eval forces you to.

# minimum-prompt scaffolding
SYSTEM = '''
You are agents-agent.
Goal: <one sentence>.
Refuse: <list>.
When uncertain: ask one clarifying question.
'''

Step 3 — wire the tool calls

Tools are where most agent builds blow up. Two anti-patterns to avoid:

Tools that take freeform strings and parse them server-side. Use structured args.
Tools that throw on bad input. Have them return an error object the model can read and recover from.

Step 4 — the eval gate

Run the eval. Fix the failures. Run it again. Anything below 100% on your unhappy-path examples means you're not done — these are the cases you specifically chose because they matter.

Step 5 — observability that's actually useful

Three signals are enough to start: per-request latency, per-request token cost, and a sampled-and-reviewed transcript log. You add more once you've seen the first 100 production calls.

Model APIs — vetted picks

Anthropic OpenAI ElevenLabs Cartesia Together AI Groq

GPU & compute — vetted picks

RunPod Vast.ai Modal Replicate Lambda Labs Hetzner

Dev tools — vetted picks

Cursor Aider Continue GitHub Copilot

Real cost numbers

Cost depends on how many calls you actually make and which model you pick. The shape of the cost curve looks like this for a build like this one:

Tier	Calls/day	Indicative monthly cost
Personal / dev	< 100	Single-digit dollars on a frontier API; $0 on a self-hosted small model.
Small product	100 – 5,000	Tens to low hundreds. Caching changes this number more than model choice does.
Mid product	5,000 – 100,000	Hundreds to low thousands. This is where the small-model cascade pattern starts to pay back its complexity.

As of writing — see the linked pricing pages above for current per-token rates.

Failure modes I hit

Prompt that worked at 10 calls failed at 1,000. Add diversity to the eval set; an unhappy-path example you haven't seen yet is just a bug you haven't met.
Latency spikes when the upstream provider had a degraded region. Add a second provider on the same interface, round-robin under load.
Tool-use loops. Cap tool calls per turn. Anything past the cap returns "max_tool_calls reached" to the model.

FAQ

How long does this build actually take?

Half a day if you've done one before. A full day the first time. The eval-gate step is the unfamiliar part for most engineers, and it's also the most valuable.

Can I skip the eval?

You can skip it the first time and you'll wish you hadn't on the third change. Keep it lightweight — three to five examples — but keep it.

What's the rollback plan?

The agent endpoint is feature-flagged. If quality dips, traffic flips back to the previous prompt version. The eval set lives in source control; rolling back is a git revert.