I Built a Claude Code Agent That Reviews PRs (Here's the Prompt)
I run a small consulting practice. Most of my client work is async PR review — five to twenty-five PRs a day, across four codebases, none of which I have full context on at any given moment. Last month I built a Claude Code agent that does the first review pass. Two weeks in, it has reviewed 312 PRs, caught 41 real issues, and shipped exactly one false positive that mattered (a security finding I'd have caught manually too). Here's the build.
The shape of the problem
The job isn't "find every bug." It's "spot the things that consistently bite you in this kind of project." Different code bases have different consistent biters. Mine are: missing error paths in async handlers, secrets sneaking into commits, schema migrations without rollback, and load-bearing comments that say "TODO: remove" two years later.
The system prompt
The whole agent is the prompt. Everything else is plumbing.
You are a PR review agent for {{REPO_NAME}}.
Your job: produce ONE markdown comment summarising the PR.
For each finding, output: { severity: low|medium|high|critical, file:line, summary, evidence }.
Review priorities (in order):
1) security: secrets in commits, dependency CVEs, missing authn checks
2) data integrity: irreversible migrations without rollback, schema drift
3) error handling: async handlers without try, raised exceptions without logging
4) load-bearing comments and TODOs older than 6 months
Refuse to review:
- PRs > 2000 lines (request a split)
- PRs touching files outside the standard tree (request human eyes)
Output schema (strict JSON):
{ "summary": str, "findings": [Finding], "blocker": bool }
Set blocker=true ONLY if a critical finding exists.
The wiring
Three pieces:
- A GitHub Action triggered on
pull_requestevents. - A small script that fetches the diff, packages the PR description and changed-files context, and shells out to Claude Code with the system prompt above.
- A second step that posts the result as a sticky review comment, and sets a check status if blocker=true.
name: AI PR Review
on: pull_request
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- run: ./scripts/ai_review.sh "$GITHUB_PR_NUMBER"
env:
ANTHROPIC_API_KEY: ${{secrets.ANTHROPIC_API_KEY}}
What broke first
The first version had no diff size cap. It happily started reviewing a 4,800-line generated migration and burned $3.40 on the response. Cap added. Second version reviewed a renamed file as if it were new code. Cap on rename heuristic added. Third version flagged the same legitimate TODO twice in two different PRs because it had no memory between runs. Memory added: a small JSON file in the repo, .ai-review-known-todos.json, with TODOs the team has explicitly accepted.
Cost reality
Per PR, the agent spends about $0.06 to $0.12 in input + output tokens (a small frontier model with prompt caching enabled). At 312 PRs the bill came in around $24. The first PR it caught — a leaked dev secret destined for a public artifact — would have cost more than that to clean up by itself.
Where I wouldn't use it
Front-end visual reviews. Architectural reviews. Migrations involving cross-service contracts. Anything where the right move is "go ask three people for context" — that's not a job for any agent yet.
What I'd change next
Right now it reviews from the diff and the changed files. The next move is feeding it the test files associated with the changed code, so it can compare claims-in-the-PR-description against tests-the-PR-actually-changes. That's the next batch of false-negatives I'm chasing.
Stack summary
| Component | Choice | Why |
|---|---|---|
| Model | Claude (frontier tier) | Best long-context reasoning on code I've used |
| Runner | Claude Code CLI in GitHub Actions | Simple shell-out, no SDK indirection |
| Caching | Anthropic prompt caching | Cuts repeat-prompt cost by ~85% across PRs in same repo |
| State | JSON file in repo | Versioned, reviewable, no extra DB |