Temporal Entity Workflows

The Version Marker You Need Before Your First Deployment

  ·  5 min read

The previous posts in this series covered failures that happen at runtime: signals dropped at a run boundary, timers reset by ContinueAsNew, state that grows unbounded over years. This one is different. The failure mode here is triggered by something you do deliberately: deploying new code.

Temporal replays workflow history deterministically. When a worker is deployed with new code, any in-flight workflow that gets picked up will replay its existing event history using that new code. If the new code makes different decisions at the same point in the history, Temporal raises a non-determinism error and the workflow fails1. For a short-lived workflow, most instances complete before a deployment lands. For an entity workflow that runs for years, every deployment is a potential breaking change.

The full implementation is at rikdc/temporal-entity-workflow-demo.

What Counts as a Breaking Change #

The Go SDK performs a runtime check that catches some incompatible changes: adding, removing, or reordering workflow API calls that produce Commands1. But the check is not exhaustive. It does not catch timer duration changes, activity input argument changes, or logic changes that alter which branch of existing code executes. Those will produce a non-determinism error in production, silently, on the next replay of an in-flight workflow.

The practical implication: in an entity workflow, the category of “things that require versioning” is broader than it first appears. It includes not just structural changes to the workflow graph but any behavioural change that affects what the workflow does at a given point in its history.

The Baseline Marker #

The minimum thing to do before any workflow reaches production is add a version marker at the top of the workflow function:

_ = workflow.GetVersion(ctx, "baseline", workflow.DefaultVersion, 1)

workflow.GetVersion records a marker event in the workflow history. On replay, it reads the recorded version from history rather than re-evaluating. This establishes a stable anchor point that all future versioning decisions are relative to1.

The timing matters. This marker must be added before the workflow is deployed to production. Adding workflow.GetVersion to a workflow that is already running in production is itself a behavioural change: existing runs have no marker in their history, so the new code’s call to GetVersion produces a Command that doesn’t match the existing event history. That’s a non-determinism error.

Put the baseline in first, before anything else.

Gating a Behavioural Change #

Once the baseline is in place, each subsequent behavioural change gets its own version gate:

v := workflow.GetVersion(ctx, "add-tier-expiry", workflow.DefaultVersion, 2)
if v >= 2 {
    // new behaviour: check tier expiry on each point event
} else {
    // old behaviour: preserved for in-flight workflows replaying old history
}

workflow.DefaultVersion is -1. Workflows that were running before this change have no marker for add-tier-expiry in their history, so GetVersion returns DefaultVersion and they take the old path. New workflows record version 2 and take the new path. Both versions coexist in the same deployed binary1.

The change ID ("add-tier-expiry" here) is permanent. It’s recorded in the event history of every workflow that passes through this gate. Reusing a change ID for a different change would corrupt that record. Once a change ID is in production, it stays.

The Deprecation Lifecycle #

Version gates accumulate over time. The deprecation path has four steps:

sequenceDiagram
    participant Deploy1 as Deploy: both branches
    participant Deploy2 as Deploy: raise minimum
    participant Deploy3 as Deploy: remove gate

    Deploy1->>Deploy1: GetVersion("add-tier-expiry", DefaultVersion, 2)
    Note over Deploy1: Old and new workflows coexist
    Deploy2->>Deploy2: GetVersion("add-tier-expiry", 2, 2)
    Note over Deploy2: Old path removed, old workflows\nmust have completed or ContinueAsNew'd
    Deploy3->>Deploy3: Remove GetVersion call entirely
    Note over Deploy3: Safe once no workflows\npre-date this change

Raising the minimum version in step 2 (workflow.GetVersion(ctx, "add-tier-expiry", 2, 2)) removes the old branch from new deployments and causes an error if any remaining in-flight workflow is still replaying with the old version. This is intentional: it’s the safety check that confirms all old workflows have either completed or transitioned through ContinueAsNew into the new version.

ContinueAsNew is a natural upgrade point for this reason. The new run starts fresh, picks up the current worker code, and records the current version on its first pass through the gate.

What Requires a Version Gate #

The scope of changes that need a version gate is wider than it might seem:

  • Adding, removing, or reordering activities
  • Changing timer durations
  • Adding new branches to workflow logic
  • Changing activity input types or return types used in workflow decisions
  • Adding new workflow.GetVersion calls themselves

What doesn’t need one: changes that don’t affect the sequence of Commands emitted during replay. Logging, metrics, comments, and purely internal refactors that don’t touch workflow API calls are safe without a gate.

When in doubt, add the gate. The cost of an unnecessary version gate is a few lines of code that get cleaned up later. The cost of a missing one is a non-determinism error on a workflow that may have been running for two years.

What to Take From This #

The previous posts in this series were about failure modes that emerge gradually, over months or years of operation. Versioning is different: it becomes relevant from the first deployment, and the baseline marker is the one thing that must be in place before any workflow goes to production.

After that, the pattern is consistent. Every behavioural change gets a gate. Change IDs are permanent. Old branches stay until you can confirm no in-flight workflow needs them. ContinueAsNew is the natural mechanism for moving running workflows from old behaviour to new.


This is the final post in the series. Post 1 introduced the entity workflow pattern and the four failure modes covered here. Each one came from an assumption that holds within a single short workflow run and breaks across the lifetime of a long-running entity.


  1. Temporal Go SDK documentation: Versioning ↩︎ ↩︎ ↩︎ ↩︎