When Code Review Starts Opening Pull Requests

  ·  6 min read

The original system could review a pull request, identify a nil dereference on line 47, explain the bug in plain English, and suggest a clean fix. Then it politely stopped and waited for a human to do the actual work.

Useful, yes. Agentic, not yet.

In the prior post, I covered the first version of the architecture: a Temporal workflow kicked off by a GitHub webhook, four specialized LLM reviewers running in parallel, and a synthesis step that produced a structured ReviewSummary.

This post picks up where that one stopped: turning review output into code changes, a branch, and a follow-up pull request, without waiting for someone to copy-paste the obvious bits by hand. The review phase is still straightforward multi-agent orchestration. After triage, the workflow decides at runtime how many fix workflows to spawn, and which findings are allowed anywhere near automation.

Triage Needs Two Axes #

My first pass at triage used severity as the gate.

  • Low and medium: auto-fix
  • High and critical: human review

That worked badly in both directions.

A high-severity “missing godoc on all exported functions” finding is boring, mechanical, and perfectly safe to patch automatically. A low-severity “auth token stored in a cookie without SameSite” finding might be subtle, security-sensitive, and a terrible place to let a model freestyle. Severity tells you how important a finding is. It does not tell you whether the fix should be automated.

So triage split into two separate dimensions:

  • Severity: how important is this finding?
  • Disposition: what should happen next?

The disposition values are simple:

  • autofix
  • review-required
  • information

Anything touching authentication, payments, cryptography, concurrency primitives, or API contracts is always marked review-required, regardless of severity. That is the actual policy. “We auto-fix medium issues” is how you end up explaining to security why a language model edited auth code because the number happened to be small.

The TriageAgent takes the full set of findings from all reviewers and emits two buckets: autofix and review-required. Everything after that flows from the split.

The Fix Phase Becomes Dynamic #

For each autofix finding, the parent workflow spawns a fix_finding child workflow rather than an activity, mainly to isolate retries, history, and failures per fix. If one fix generator times out, the parent can still collect the successful ones and move on.

Each fix_finding child workflow does two things:

  1. Read the target file from GitHub at the exact commit SHA.
  2. Call the FixGenerator with the file content, finding details, and fix instructions, then return a unified diff.

Reading by SHA rather than branch name turned out to be non-negotiable. An earlier version used the branch ref and quietly fell apart on fork PRs, rebases, deleted branches, and other bits of normal Git behavior. Branches are vibes. SHAs are facts.

The FixGenerator uses Claude Haiku. This is a narrow, mechanical task: produce a targeted diff from explicit instructions. It does not need the deeper reasoning budget used by triage and security review.

The fan-out is decided at runtime. If triage emits zero autofixable findings, the fan-out is zero. If it emits ten, the parent spawns ten fix_finding children in parallel. That is the part that feels swarm-like. It is still just dynamic orchestration, though, not mystical emergent intelligence.

Coalescing Patches Without Making a Mess #

Once the child workflows finish, the parent hands the results to CoalesceActivity, which keeps the successful fixes, resolves file-level conflicts, applies diffs, and creates a Git tree, commit, and branch through the GitHub API.

The atomic commit path matters for idempotency. Temporal can retry activities, so you do not want half the patch set already published somewhere when coalescing restarts.

The branch name is deterministic: ai-fixes/pr-{number}-{sha[:8]}.

This was learned the annoying way. A previous version used time.Now().Unix() in the branch name. Every retry produced a fresh branch, so the idempotency check never matched the earlier attempt. The result was a quiet colony of orphan branches multiplying in the dark. It took about forty of them before the lesson became memorable.

The follow-up PR includes applied fixes, escalated findings, and skipped conflicts. That is enough context for a human to see what the system changed, and what it very intentionally refused to touch.

The Real Design Question #

The interesting question was always what happens when the system gets a patch wrong.

A bad review comment costs a developer maybe thirty seconds. A bad patch costs investigation, reversion, and lost trust. A bad patch in auth or payments becomes a security or correctness problem with a much uglier blast radius.

That is what the triage rules are encoding. High-stakes findings never go to the fix generator at all. The human review gate exists as a hard boundary around what the system is permitted to change.

Temporal Carries More Than Its Share #

The same Temporal primitives that make parallel batch processing tractable are doing the hard, unglamorous work here too. The only difference is that the activities are calling LLM APIs, which are less predictable than a database and much more eager to waste your money when retries go wrong.

Each review agent heartbeats at 20%, 40%, 60%, and 80% completion so Temporal does not mistake a slow model call for a dead activity. HeartbeatTimeout needs to be longer than the p95 latency of the model you are calling. Otherwise, Temporal helpfully retries work that was merely slow, and you get to pay for duplicated inference while debugging a problem you created yourself.

The TriageAgent also uses NonRetryableErrorTypes for JSON parse failures caused by response truncation. Those failures were deterministic. If the payload was too large and the model got clipped once, retrying it three times just produced the same broken output with more ceremony. The actual fix was increasing max_tokens for triage from 2000 to 8000, because triage aggregates findings from all four review agents and therefore has the largest response shape in the system.

Temporal visibility is the other quiet advantage. When ten child workflows are running and three are timing out, the UI shows exactly which child failed, where, and how often. In distributed systems, having the answer be obvious is an underrated luxury.

The Useful Threshold #

The code review system became meaningfully more agentic when it started opening pull requests without being asked.

That changes the standard immediately. A system that comments on code can be wrong in mildly annoying ways. A system that edits code and pushes branches gets judged by a different standard, especially in the parts of the codebase where mistakes are expensive.

The trust here comes from structure, not charm. Structured output, explicit triage rules, deterministic orchestration, and hard boundaries around risky domains do most of the serious work. The model helps, obviously, but left unsupervised it would also cheerfully produce nonsense with perfect confidence and excellent formatting. That is not a trait to build a control plane around.

What matters after that is repetition. Once fixes are structured and tracked, you can see which categories hold up, which ones get reverted, and which ones should have stayed behind a human review gate. That is where the useful signal lives. The rest is mostly branding.

The economics help. Anthropic’s new Code Review has been reported at roughly $15 to $25 per review. This workflow currently lands around $0.75. That leaves plenty of room to capture actual operational data, compare outcomes over time, and learn something before anyone starts treating machine judgment like a premium artisanal product. Structured output makes that loop possible. Free-form prose still has uses, but mostly for commentary and, when things go badly, a suspiciously well-written incident review.


The code is available at github.com/rikdc/temporal-code-reviewer.