Prism: Multi-Agent PR Reviews with Temporal

  ·  11 min read

LLM code review is genuinely useful right up until you notice it keeps finding the same three things: a potential nil dereference, a suggestion to add more error handling, and a note that your function is “somewhat long.” Technically accurate. Not quite the point.

The problem compounds when you try to build on top of it. A wall of suggestions with no severity, no category, and no structure is just text — you can’t block a merge on it, route it to the right person, or track whether the same class of issue keeps appearing. You read it, act on some of it, and close the tab. Nothing accumulates.

Prism is my attempt to fix that by being considerably more annoying about it. It’s primarily a proof-of-concept for multi-agent LLM orchestration with Temporal, though it ended up useful enough that I kept the code review framing. Four specialized agents review the same PR in parallel, each focused on one domain, and a synthesis agent aggregates the findings into a final recommendation. This post covers the architecture and why structured JSON output matters more than you might expect.

The Architecture #

Prism orchestrates four agents concurrently via Temporal workflows:

  • Security Agent: Vulnerabilities, security anti-patterns, potential exploits
  • Style Agent: Code quality, readability, convention adherence
  • Logic Agent: Bugs, edge cases, nil dereferences, correctness
  • Documentation Agent: Godoc coverage, comment quality, README completeness

Each agent has a focused prompt stored as a markdown file, calls an LLM via OpenRouter, and returns structured findings with severity levels. Once all four complete, a Synthesis Agent aggregates everything into an overall recommendation: approved, needs_changes, or blocked.

A real-time dashboard shows agent progress via Server-Sent Events, which felt like an indulgence until it turned out to be the most useful debugging tool in the whole system.

PR Submitted
     │
     └── Temporal Workflow
           ├── SecurityAgent  ──┐
           ├── StyleAgent     ──┤
           ├── LogicAgent     ──┤ (all parallel)
           └── DocsAgent      ──┘
                                │
                         SynthesisAgent
                                │
                         Final Recommendation

One Prompt, Four Jobs #

Ask one LLM to review code and it tries to cover everything. The result is surface-level coverage across all dimensions rather than depth in any. Security crowds out logic; style fills space that should go to documentation. It’s trying to be four people at once, which is roughly as effective as it sounds.

Specialized agents sidestep this with one mechanism: each prompt is scoped to a single domain, so the entire attention budget goes to that concern. The security agent isn’t trading attention with formatting feedback. The documentation agent doesn’t get crowded out.

You also get structured output from each agent (findings with severity levels), which makes it possible to filter and prioritize programmatically. The synthesis agent can say “two security issues of severity HIGH, four style issues of severity LOW” rather than producing a wall of undifferentiated text. This mirrors how thorough human code review actually works — when security-critical code ships, you don’t want the person checking authentication logic also juggling whether the variable names follow conventions. Context-switching between domains isn’t depth.

Temporal Makes the Orchestration Trivial #

The obvious alternative is errgroup: four goroutines, collect results, pass to synthesis. Fifteen lines, no infrastructure dependency. That handles the coordination.

What it doesn’t handle: LLM APIs are unreliable by nature, and retry logic for that class of failure isn’t optional — it’s load-bearing infrastructure. Writing it per-agent with backoff, partial failure handling, and timeout configuration is the kind of code that takes a weekend and then quietly causes incidents for months. The RetryPolicy and HeartbeatTimeout in the workflow below are that logic. With errgroup, it’s code you write and maintain. The SSE dashboard is in the same category — per-agent progress tracking is trivial when Temporal already maintains workflow state. The other honest factor: if you’re already running a Temporal cluster, adding a workflow is operationally free. If you’re not, this project doesn’t justify spinning one up.

With Temporal, the parallel orchestration is roughly 30 lines of workflow code:

func PRReviewWorkflow(ctx workflow.Context, input PRReviewInput) (*ReviewResult, error) {
    ao := workflow.ActivityOptions{
        StartToCloseTimeout: 90 * time.Second,
        HeartbeatTimeout:    30 * time.Second,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 3,
        },
    }
    ctx = workflow.WithActivityOptions(ctx, ao)

    securityFuture := workflow.ExecuteActivity(ctx, SecurityAgent, input)
    styleFuture    := workflow.ExecuteActivity(ctx, StyleAgent, input)
    logicFuture    := workflow.ExecuteActivity(ctx, LogicAgent, input)
    docsFuture     := workflow.ExecuteActivity(ctx, DocsAgent, input)

    var results [4]AgentResult
    futures := []workflow.Future{securityFuture, styleFuture, logicFuture, docsFuture}
    for i, future := range futures {
        if err := future.Get(ctx, &results[i]); err != nil {
            workflow.GetLogger(ctx).Error("agent failed", "index", i, "error", err)
        }
    }

    var synthesis ReviewSummary
    err := workflow.ExecuteActivity(ctx, SynthesisAgent, results).Get(ctx, &synthesis)
    return &ReviewResult{Summary: synthesis}, err
}

Parallel execution, timeouts, automatic retries on transient LLM API failures… all handled. Each agent runs as a Temporal activity with its own heartbeat configuration. If OpenRouter times out for one agent, Temporal retries it. If a worker crashes mid-review, Temporal resumes from the last checkpoint. You don’t write any of that logic yourself, which is the entire point.

I’ve written about this pattern before in the context of financial batch workflows. The coordination problem is identical; the only thing that changes is what the activities are actually doing.

Structured LLM Output #

Getting LLMs to return structured JSON is what turns a demo into something that composes with other systems. A high-severity security finding that blocks a merge or routes to Slack is only possible if it’s structured data. Free-form text lives in a dashboard. Structured data integrates with everything else — severity filtering, merge blocking, historical analysis, Jira ticket creation.

Setting response_format: json_object and calling it done doesn’t hold up. OpenRouter doesn’t consistently enforce that parameter across all models. Some models comply. Some wrap their JSON in markdown code blocks anyway, apparently as a courtesy you didn’t ask for. Getting reliable structured output required three layers working together.

Layer 1: API Response Format #

// llm/client.go
ResponseFormat: &openai.ChatCompletionResponseFormat{
    Type: openai.ChatCompletionResponseFormatTypeJSONObject,
}

On models that support it natively, this constrains the sampler to produce only valid JSON tokens — structurally correct output, even if schema compliance is still your problem.

Layer 2: Explicit Prompt Instructions #

The prompts repeat the JSON requirement multiple times, in bold, at the top and bottom:

# Logic Review Agent

**CRITICAL: You MUST respond ONLY with valid JSON. Do not include any text
before or after the JSON. Your entire response must be parseable as JSON.**

## Output Format

**IMPORTANT: Your response must be ONLY valid JSON. No markdown code blocks,
no explanatory text, no preamble. Just the raw JSON object.**

Your response must match this EXACT schema:

{
  "status": "passed" | "warning" | "failed",
  "findings": [
    {
      "severity": "critical" | "high" | "medium" | "low",
      "title": "Brief description",
      "description": "Detailed explanation with line references"
    }
  ],
  "summary": "Overall assessment"
}

The redundancy is intentional. LLMs respond well to emphatic, repeated instructions; the prompt is an instruction, not a contract.

Layer 3: Robust Parsing with Fallback #

Even with layers one and two, some responses slip through with markdown fences or stray text. The parser strips common wrappers before attempting to decode, and falls back gracefully if decoding still fails:

// activities/parse_utils.go
func extractJSON(content string) string {
    if idx := strings.Index(content, "```json"); idx != -1 {
        content = content[idx+7:]
        if end := strings.Index(content, "```"); end != -1 {
            content = content[:end]
        }
    } else if idx := strings.Index(content, "```"); idx != -1 {
        content = content[idx+3:]
        if end := strings.Index(content, "```"); end != -1 {
            content = content[:end]
        }
    }
    return strings.TrimSpace(content)
}

func parseStructuredReview(content string, agentName string) (*StructuredReview, string) {
    cleaned := extractJSON(content)

    var review StructuredReview
    if err := json.Unmarshal([]byte(cleaned), &review); err != nil {
        return &StructuredReview{
            Status:  "warning",
            Summary: "Review completed but response format was not valid JSON",
            Findings: []Finding{{
                Severity:    "low",
                Title:       "Raw LLM Response",
                Description: content,
            }},
        }, content
    }
    return &review, ""
}

The fallback matters. If parsing fails, the system doesn’t crash; it wraps the raw response in a valid StructuredReview so the synthesis agent still has something to work with. Model non-compliance never breaks the workflow contract — downstream code always receives a typed struct, and the model’s misbehaviour stays contained at the boundary where it happened.

The Schema #

type StructuredReview struct {
    Status   string    `json:"status"`   // "passed", "warning", "failed"
    Findings []Finding `json:"findings"`
    Summary  string    `json:"summary"`
}

type Finding struct {
    Severity    string `json:"severity"`    // "critical", "high", "medium", "low"
    Title       string `json:"title"`
    Description string `json:"description"` // Detailed with line references
}

This schema is the contract the rest of the system builds on. Once you have severity-tagged findings as structured data, the useful integrations become obvious: block merges on failed status, track which categories appear most often, route critical findings to the right person. Without the schema, you’re doing vibes-based code review.

The Prompt Architecture #

The four agent prompts live in markdown files — security scoped to OWASP categories and injection patterns, logic to nil dereferences and error path coverage, style to naming and complexity, docs to exported symbol completeness. Each is deliberately narrow: a logic agent that’s also auditing credentials isn’t really doing either job.

Storing prompts as .md files rather than hardcoded strings makes iteration fast. You can update an agent’s focus without touching Go code, version changes in git, and compare prompt variations by diffing files. It also means you can tune agents independently — if the security agent is over-reporting false positives, you fix that prompt without touching anything else. Once the raw LLM response is parsed into a StructuredReview, it gets packaged into the workflow-level contract that flows through Temporal:

type AgentFinding struct {
    Severity    string `json:"severity"`    // HIGH, MEDIUM, LOW, INFO
    Category    string `json:"category"`
    Description string `json:"description"`
    LineRef     string `json:"line_ref,omitempty"`
    Suggestion  string `json:"suggestion,omitempty"`
}

type AgentResult struct {
    AgentName string         `json:"agent_name"`
    Status    string         `json:"status"`    // approved, needs_changes, blocked
    Findings  []AgentFinding `json:"findings"`
    Summary   string         `json:"summary"`
}

AgentResult is the per-agent contract returned to the workflow. It’s distinct from StructuredReview above — StructuredReview is the internal parse target for decoding raw LLM output; AgentResult is what that parsed data gets packaged into before it reaches the synthesis stage. The synthesis agent receives four of these and produces a final recommendation, with each agent’s contribution preserved.

Where This Is Useful #

Prism isn’t a replacement for human code review. It’s a filter before human review — catching the mechanical stuff so reviewers can focus on things that actually require judgment.

It’s genuinely good for junior developer feedback. Knowing that an issue is a logic problem versus a style nit matters for how you fix it — and for how you learn from it. A nil dereference and a long function are different kinds of problems; treating them the same doesn’t help anyone develop judgment. A single consolidated opinion doesn’t give you that signal. Four specialized agents do.

Security-critical code, payment flows, and authentication systems benefit most: the security agent running independently catches things a combined review tends to miss. Once those findings are surfaced, human reviewers can focus on what actually requires judgment — does the auth design make sense, does this payment flow handle the failure case correctly. The mechanical layer has already been handled.

For teams with inconsistent review standards, the same logic applies differently: every PR gets the same four-agent treatment. There’s no “I’ll let the style stuff slide this time.”

What’s Next #

Once findings are structured data, integrations that would be impossible with free-form text become straightforward:

  • GitHub PR comment integration: Post findings as inline review comments on the PR, rather than in a separate dashboard
  • Jira integration: Auto-create tickets for high-severity findings — straightforward now that findings are structured data
  • Custom agent configuration: Let teams define specialized agents for domain-specific concerns — PCI compliance, internal API contract rules, whatever the codebase actually needs
  • Historical tracking: Track which agents catch the most issues and which categories recur most often — over time, that data tells you which parts of your codebase are systematically under-reviewed
  • Cost optimization: Route cheaper models to lower-stakes agents. Style and docs reviews don’t need the same model as security analysis

The Takeaway #

Multi-agent LLM orchestration looks harder than it is. Structured LLM output looks easier than it is. Those two asymmetries turn out to be related.

Multi-agent systems are straightforward to build when the infrastructure handles coordination. The hard parts (parallel execution, fault tolerance, state management, and retry logic) are Temporal primitives. For Prism, the parts you write are prompt design and agent behaviour — not distributed systems plumbing.

Structured LLM output is non-negotiable once you’re building something that integrates with existing workflows. The three-layer approach (API parameter + explicit prompts + robust parsing) handles the reality that models don’t perfectly follow instructions. Without it, you have a demo. With it, you have something teams can depend on.

When orchestration is boring and outputs are typed, code review becomes something you can reason about. Not something you squint at.


The code is available at github.com/rikdc/temporal-code-reviewer.