I Built a Claude Code Agent That Reviews PRs (Here's the Prompt)

By OrionAI Build Editorial · Published 2026-05-10 · // build

I run a small consulting practice. Most of my client work is async PR review — five to twenty-five PRs a day, across four codebases, none of which I have full context on at any given moment. Last month I built a Claude Code agent that does the first review pass. Two weeks in, it has reviewed 312 PRs, caught 41 real issues, and shipped exactly one false positive that mattered (a security finding I'd have caught manually too). Here's the build.

The shape of the problem

The job isn't "find every bug." It's "spot the things that consistently bite you in this kind of project." Different code bases have different consistent biters. Mine are: missing error paths in async handlers, secrets sneaking into commits, schema migrations without rollback, and load-bearing comments that say "TODO: remove" two years later.

The system prompt

The whole agent is the prompt. Everything else is plumbing.

You are a PR review agent for {{REPO_NAME}}.
Your job: produce ONE markdown comment summarising the PR.
For each finding, output: { severity: low|medium|high|critical, file:line, summary, evidence }.

Review priorities (in order):
  1) security: secrets in commits, dependency CVEs, missing authn checks
  2) data integrity: irreversible migrations without rollback, schema drift
  3) error handling: async handlers without try, raised exceptions without logging
  4) load-bearing comments and TODOs older than 6 months

Refuse to review:
  - PRs > 2000 lines (request a split)
  - PRs touching files outside the standard tree (request human eyes)

Output schema (strict JSON):
{ "summary": str, "findings": [Finding], "blocker": bool }
Set blocker=true ONLY if a critical finding exists.

The wiring

Three pieces:

A GitHub Action triggered on pull_request events.
A small script that fetches the diff, packages the PR description and changed-files context, and shells out to Claude Code with the system prompt above.
A second step that posts the result as a sticky review comment, and sets a check status if blocker=true.

name: AI PR Review
on: pull_request
jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - run: ./scripts/ai_review.sh "$GITHUB_PR_NUMBER"
        env:
          ANTHROPIC_API_KEY: ${{secrets.ANTHROPIC_API_KEY}}

What broke first

The first version had no diff size cap. It happily started reviewing a 4,800-line generated migration and burned $3.40 on the response. Cap added. Second version reviewed a renamed file as if it were new code. Cap on rename heuristic added. Third version flagged the same legitimate TODO twice in two different PRs because it had no memory between runs. Memory added: a small JSON file in the repo, .ai-review-known-todos.json, with TODOs the team has explicitly accepted.

Cost reality

Per PR, the agent spends about $0.06 to $0.12 in input + output tokens (a small frontier model with prompt caching enabled). At 312 PRs the bill came in around $24. The first PR it caught — a leaked dev secret destined for a public artifact — would have cost more than that to clean up by itself.

Where I wouldn't use it

Front-end visual reviews. Architectural reviews. Migrations involving cross-service contracts. Anything where the right move is "go ask three people for context" — that's not a job for any agent yet.

What I'd change next

Right now it reviews from the diff and the changed files. The next move is feeding it the test files associated with the changed code, so it can compare claims-in-the-PR-description against tests-the-PR-actually-changes. That's the next batch of false-negatives I'm chasing.

Stack summary

Component	Choice	Why
Model	Claude (frontier tier)	Best long-context reasoning on code I've used
Runner	Claude Code CLI in GitHub Actions	Simple shell-out, no SDK indirection
Caching	Anthropic prompt caching	Cuts repeat-prompt cost by ~85% across PRs in same repo
State	JSON file in repo	Versioned, reviewable, no extra DB

Model APIs — vetted picks

Anthropic OpenAI ElevenLabs Cartesia Together AI Groq

GPU & compute — vetted picks

RunPod Vast.ai Modal Replicate Lambda Labs Hetzner

Dev tools — vetted picks

Cursor Aider Continue GitHub Copilot