Temporal Entity Workflows
When Your Workflow Has No Finish Line
· 6 min read
Most Temporal workflows model a task: process this order, send this notification, run this pipeline. They start, they do a thing, they stop. Reasoning about their failure modes is bounded by how long they run.
Entity workflows model an object, not a process: not “enrolling a customer” but the customer themselves, for as long as they exist in the system. That distinction is small on paper and significant in practice.
This post is about what that looks like, and why it changes which assumptions you can make.
The Basic Shape #
A Temporal entity workflow is a long-running workflow instance tied to the lifetime of a domain object. In this case: a customer’s rewards membership. One workflow per customer. It starts on enrollment, runs for months or years, and handles every significant event in that customer’s membership lifecycle.
The full implementation is available at rikdc/temporal-entity-workflow-demo.
The architecture that supports it looks like this:
flowchart TD
CLI["CLI (enroll / add-points / status / unenroll)"]
TS["Temporal Server (durable buffer)"]
W["Worker (RewardsWorkflow)"]
A["Activities (Enroll, NotifyTierChange, RecordUnenrollment)"]
CLI -->|signals / queries| TS
TS -->|workflow tasks| W
W -->|activity tasks| A
Four things worth naming explicitly before moving on:
Signals are how external events reach the workflow. They’re asynchronous and durable: if the worker is down when a signal arrives, the Temporal server buffers it. The event is not lost. It will be delivered when the worker is back.
Queries read live in-memory workflow state. They don’t touch a database, and they don’t add entries to event history. One catch: a worker must be running to serve a query. If no workers are available, the query fails.
Activities handle all side effects: database writes, tier change notifications, anything that touches the outside world. They get automatic retries with configurable policies.
ContinueAsNew is how the workflow stays manageable over time. Temporal records every event in a workflow’s history. Left unchecked, that history would grow without bound, making replay progressively slower and heavier. ContinueAsNew ends the current run, starts a new one with the same workflow ID, and passes state as the new run’s input. From the outside, nothing changes. The workflow ID is stable. Signals and queries keep working. It’s closer to turning a page than closing a book.
Worth saying plainly: ContinueAsNew exists to work around Temporal’s history limits (51,200 events or 50MB per execution)1, not because it’s an elegant feature in its own right. One external author called it “a workaround for a limitation rather than a feature.”2 That framing is fair. It does what it needs to do, but it introduces its own failure modes, which is most of what the rest of this series covers.
How Entity Workflows Break Differently #
A short-lived workflow fails within a bounded window. You can reason about what state it was in, fix the underlying issue, and let it retry or rerun. An entity workflow runs for years, and that changes the calculus considerably: the surface area for failure grows with time, and some failure modes only become visible after months of operation.
The diagram below shows the four failure modes that surfaced while building this system. None of them appear in a short workflow. All of them appear in a long-running entity workflow, given enough time.
graph LR
A([Entity Workflow]) --> B[History accumulation]
A --> C[State lost across ContinueAsNew]
A --> D[In-flight deployment conflicts]
A --> E[Unbounded dedup state]
B --> B1["Replay slows as history grows"]
C --> C1["Timers don't survive run boundaries"]
D --> D1["New code must match existing event history"]
E --> E1["Dedup map grows silently for years"]
History accumulation #
Every signal, activity result, and timer creates events. For a short workflow, this is irrelevant. For one that runs for years, unchecked history means progressively slower replay: the worker has to reconstruct state from the full event log each time it picks up a workflow it doesn’t have cached. Temporal provides a GetContinueAsNewSuggested() call on workflow info that returns true when the workflow is approaching history limits3. A fixed event-count threshold (the implementation here uses 200, chosen conservatively well below Temporal’s limits) is a simpler and more predictable trigger.
State that outlives its assumptions #
In a short workflow, you can assume a timer set at the start of a run will fire before the run ends. In an entity workflow spanning multiple ContinueAsNew cycles, that assumption breaks. Timers don’t survive ContinueAsNew. All pending timers are cancelled when a run ends. Any state that needs to persist across runs has to be explicitly carried in the input struct passed to the new run. This includes things like “when did this customer last earn points,” which turns out to matter a great deal when you’re trying to expire their points after 365 days of inactivity.
In-flight deployments #
Temporal replays workflow history deterministically. When a worker is deployed with new code, in-flight workflows will replay their existing history using the new code. If the new code makes different decisions at the same point in history, Temporal raises a non-determinism error. For a short workflow, most instances complete before a deployment lands. For a workflow that runs for years, every deployment is a potential compatibility event. Versioning with workflow.GetVersion becomes mandatory, not optional, and the baseline marker has to be in place before the first deployment, not added later.
Deduplication over time #
Point events in the rewards workflow include an optional deduplication key to prevent double-counting if a signal is retried. A naive implementation stores every key ever seen in a map on the workflow state. At month one, that map is tiny. At year three, it contains thousands of entries, serialised and carried through every ContinueAsNew. What starts as a sensible guard becomes an unbounded data structure quietly inflating state on every run.
The Pattern Behind All Four #
The bugs aren’t complicated individually, and once you can see each one the fix is usually straightforward. The harder problem is knowing which assumptions to question before they fail in production. Each failure mode below comes from an assumption that holds within a single short workflow run but breaks across the lifetime of a long-running entity:
| Assumption | Breaks when |
|---|---|
| Signals are processed before ContinueAsNew | Multiple signals arrive during activity execution |
| Timers persist across runs | ContinueAsNew cancels all pending timers |
| State structs serialise correctly | Fields are unexported |
| The dedup map stays small | The workflow runs for years |
| Code changes are safe to deploy | In-flight workflows replay with new code |
The fixes, taken one at a time, are in the posts that follow. The broader point is this: entity workflows require you to think about correctness at a different timescale, not “does this work in this run” but “does this hold across the full lifetime of the thing.” That’s a different kind of question, and it tends to surface answers you weren’t expecting.
Next: how signals can be silently dropped at the exact moment ContinueAsNew fires, and what it takes to drain a channel safely before the run ends.
-
Temporal documentation: Managing very long-running Workflows ↩︎
-
Long Quanzheng: Guide to ContinueAsNew in Cadence/Temporal workflow ↩︎
-
Temporal Go SDK documentation: Continue-As-New ↩︎