Temporal Entity Workflows

Idempotency Keys Don't Keep Forever

  ·  4 min read

The previous posts in this series covered signal loss at the ContinueAsNew boundary and timers that don’t survive run transitions. Both were runtime failures: things that broke when the workflow was under load or crossing a run boundary. This post is about a different kind of problem: state that grows without bound, and a deduplication design that works correctly in the short term but quietly degrades over years.

Point events in the rewards workflow include an optional DeduplicationKey. If a client retries a signal, the workflow checks the key against a set of already-processed keys and skips the event if it’s seen it before. The original implementation was a map[string]bool on the workflow state.

The full implementation is at rikdc/temporal-entity-workflow-demo.

The Original Approach #

type RewardsState struct {
    ProcessedKeys map[string]bool
}

On each point event: check if the key exists, skip if yes, process and store if no. Straightforward. The problem is that RewardsState is serialised into the input struct on every ContinueAsNew and lives in the workflow for the full duration of the customer’s membership.

Every unique key ever received is stored forever. A customer earning points daily accumulates roughly 1,000 keys over three years. That’s manageable in isolation, but a buggy or adversarial client sending a unique key on every request has no bound at all. The map grows indefinitely, serialised and carried through every ContinueAsNew, inflating the workflow state on every run with data that is long past being useful.

The deeper issue is a mismatch between what deduplication actually requires and what the map provides. Deduplication needs to remember keys for as long as the source system might plausibly retry a request. For most systems that window is hours or days, not years. The map was treating a short-term guarantee as permanent storage.

The Fix: A Sliding Window Store #

The solution replaces the map with a purpose-built idempotencyStore that bounds its own size through time-based eviction:

type idempotencyStore struct {
    Keys    map[string]idempotencyRecord `json:"keys"`
    Ordered []string                     `json:"ordered"`
}

The Keys map provides O(1) lookup. The Ordered slice maintains insertion order, which is what makes time-based eviction efficient: walk from the front, stop at the first unexpired entry, and everything before that point is safe to delete.

func (s *idempotencyStore) Evict(ageDays int, now time.Time) {
    cutoff := now.AddDate(0, 0, -ageDays)

    i := 0
    for i < len(s.Ordered) {
        if s.Keys[s.Ordered[i]].ProcessedAt.Before(cutoff) {
            delete(s.Keys, s.Ordered[i])
            i++
        } else {
            break
        }
    }

    trimmed := make([]string, len(s.Ordered)-i)
    copy(trimmed, s.Ordered[i:])
    s.Ordered = trimmed
}

The copy on the final line is deliberate. Re-slicing with s.Ordered = s.Ordered[i:] would leave the evicted strings in the backing array, where the garbage collector can’t reach them. The make and copy pattern releases that memory properly.

Eviction runs on every Push. Keys older than IdempotencyGuaranteeTime (3 days, reflecting realistic retry windows for the source system) are removed. The store stays bounded regardless of how long the workflow runs.

Why the Data Structure Matters #

The insertion-ordered slice is worth being explicit about. A map alone gives fast lookup but no efficient way to find old keys without scanning the entire structure. A sorted structure would work but requires a sort on every insert. An insertion-ordered slice works because deduplication keys are processed in roughly chronological order: the oldest keys are at the front, the newest at the back. Eviction is a single linear scan from the front until the first unexpired entry, then stop.

graph LR
    A["Push(key)"] --> B["Check Keys map\n(O(1) lookup)"]
    B -->|Duplicate| C["Skip event"]
    B -->|New| D["Add to Keys map\nAppend to Ordered slice"]
    D --> E["Evict(ageDays)\nWalk Ordered from front\nDelete expired from Keys\nCopy trimmed slice"]

The eviction cost is O(k) where k is the number of expired keys being removed, not O(n) over the full store. For a 3-day window with daily point events, that’s at most a handful of deletions per push.

What to Take From This #

The original map was correct as written. The failure wasn’t in the logic, it was in the assumption that a short-term deduplication window could be implemented as permanent storage without consequences. In a workflow that runs for years, any unbounded accumulation in state has the same effect: it gets carried through every ContinueAsNew, growing quietly until it becomes a problem.

Choosing the right window for ageDays is a product decision as much as a technical one. It should reflect the actual retry behaviour of the source system, not a conservative guess. Too short and you lose deduplication coverage; too long and you’re back to unbounded growth by another name. Three days is a starting point, not a settled answer. In production it’s worth monitoring retry patterns from the source system and adjusting the window as that data accumulates.


Next: workflow versioning. Every deployment of a long-running workflow is a potential non-determinism error, and the baseline marker has to be in place before the first deployment, not after.