OrionAI Build logo orionai.build

Build a PR Review Agent with Claude Code

By OrionAI Build Editorial · Published 2026-05-10 · // build

This is the working build write-up for Build a PR Review Agent with Claude Code. The structure is the same one I use when shipping for paying customers: scope, stack, walkthrough, edge cases, costs, monitoring.

// scope

One concrete deliverable. No "and then you could also..." scope creep. If a feature isn't in this scope, it isn't in this build.

Stack at a glance

Walkthrough

Step 1 — define the unhappy path first

Before any code, write down what "wrong" looks like for this build. For an agent, that's the user requests it should refuse, the inputs that should pivot to a human, and the outputs that should never appear. Pin those in a small eval set. Three to five examples is enough for the first pass.

Step 2 — minimum prompt

Start from the smallest possible system prompt that passes the unhappy-path eval. Anything beyond that is a future-you problem, not a today-you problem. Resist the urge to "just add a section about ..." until the eval forces you to.

# minimum-prompt scaffolding
SYSTEM = '''
You are agents-agent.
Goal: <one sentence>.
Refuse: <list>.
When uncertain: ask one clarifying question.
'''

Step 3 — wire the tool calls

Tools are where most agent builds blow up. Two anti-patterns to avoid:

Step 4 — the eval gate

Run the eval. Fix the failures. Run it again. Anything below 100% on your unhappy-path examples means you're not done — these are the cases you specifically chose because they matter.

Step 5 — observability that's actually useful

Three signals are enough to start: per-request latency, per-request token cost, and a sampled-and-reviewed transcript log. You add more once you've seen the first 100 production calls.

Model APIs — vetted picks
GPU & compute — vetted picks
Dev tools — vetted picks

Real cost numbers

Cost depends on how many calls you actually make and which model you pick. The shape of the cost curve looks like this for a build like this one:

TierCalls/dayIndicative monthly cost
Personal / dev< 100Single-digit dollars on a frontier API; $0 on a self-hosted small model.
Small product100 – 5,000Tens to low hundreds. Caching changes this number more than model choice does.
Mid product5,000 – 100,000Hundreds to low thousands. This is where the small-model cascade pattern starts to pay back its complexity.

As of writing — see the linked pricing pages above for current per-token rates.

Failure modes I hit

  1. Prompt that worked at 10 calls failed at 1,000. Add diversity to the eval set; an unhappy-path example you haven't seen yet is just a bug you haven't met.
  2. Latency spikes when the upstream provider had a degraded region. Add a second provider on the same interface, round-robin under load.
  3. Tool-use loops. Cap tool calls per turn. Anything past the cap returns "max_tool_calls reached" to the model.

FAQ

How long does this build actually take?

Half a day if you've done one before. A full day the first time. The eval-gate step is the unfamiliar part for most engineers, and it's also the most valuable.

Can I skip the eval?

You can skip it the first time and you'll wish you hadn't on the third change. Keep it lightweight — three to five examples — but keep it.

What's the rollback plan?

The agent endpoint is feature-flagged. If quality dips, traffic flips back to the previous prompt version. The eval set lives in source control; rolling back is a git revert.