Temporal Entity Workflows

ContinueAsNew Is a Blunt Instrument

  ·  4 min read

ContinueAsNew ends a workflow run atomically. Any signals sitting in a channel buffer that haven’t been consumed yet don’t transfer to the new run. This post is about what happens when you don’t account for that, and how to fix it.

The full implementation is at rikdc/temporal-entity-workflow-demo.

What ContinueAsNew Does (and Doesn’t Do) #

When the workflow returns a ContinueAsNewError, the current run ends. A new run starts immediately with the same workflow ID, a fresh event history, and the state you passed as input. From the outside it’s seamless: the workflow ID is stable, signals still route correctly, queries still work. It looks like one continuous execution.

What it doesn’t do is carry over anything in-flight in the current run. Pending timers are cancelled. Any signals sitting in a channel buffer that haven’t been consumed yet are not transferred to the new run. The new run starts clean, and those signals are gone.

The Temporal Go SDK documentation is direct about this1: before using ContinueAsNew, you must drain signal channels asynchronously. If you don’t, signals will be lost. There’s no error, no warning, and no recovery path.

The Bug #

The original implementation triggered ContinueAsNew immediately after processing a single event, whenever EventCount hit a multiple of 200:

selector.Select(ctx)  // processes ONE signal

if state.EventCount%200 == 0 {
    return workflow.NewContinueAsNewError(ctx, RewardsWorkflow, state)
}

selector.Select unblocks when one registered channel has a message ready, executes its handler for that one message, and returns. It does not drain the channel. If three signals arrived while an activity was executing, Select processes the first one and leaves the other two sitting in the buffer.

Consider a burst of point-earning activity: a customer racks up events quickly, signals queue in the buffer, and the workflow processes them one at a time. When EventCount hits 200, the code fires ContinueAsNew immediately. Any signals still in the buffer are abandoned. The new run starts with those point events gone, and nothing in the system indicates anything went wrong.

For a rewards program, the consequence is concrete: points silently disappear.

The Fix #

The solution is to drain all pending signals from every channel before handing control to ContinueAsNew. The non-blocking ReceiveAsync is the right tool here: it returns false immediately if the channel is empty, making it safe to call in a loop without blocking the workflow.

if state.EventCount%200 == 0 && unenrollCh.Len() == 0 {
    for {
        var buffered PointEvent
        if !pointsCh.ReceiveAsync(&buffered) {
            break
        }
        applyPointEvent(ctx, &state, buffered, info.WorkflowExecution.ID)
    }

    return workflow.NewContinueAsNewError(ctx, RewardsWorkflow, state)
}

The drain loop walks the points channel until it’s empty, applying each buffered event to state before the run ends. Those events are now encoded into the state struct passed to the new run, so nothing is lost across the boundary.

The unenroll guard matters too #

The unenrollCh.Len() == 0 condition on the outer check is doing something equally important, and it’s easy to miss. If an unenroll signal is pending when ContinueAsNew fires, the new run starts as if the customer is still enrolled: it re-initialises its active state, and the unenroll event is gone. The customer has been silently re-enrolled.

The guard prevents this by skipping ContinueAsNew entirely if an unenroll is pending. The main loop picks up the unenroll on the next Select, handles it properly, and the workflow terminates cleanly.

This is the broader point. The drain logic needs to account for every channel the workflow listens on, not just the primary one. A signal arriving on any channel at the wrong moment can cause the same class of problem.

The Race, Visualised #

sequenceDiagram
    participant Client
    participant Temporal
    participant Workflow

    Client->>Temporal: Signal: add-points (x3)
    Temporal->>Workflow: Deliver signal 1
    Workflow->>Workflow: selector.Select() processes signal 1
    Note over Workflow: EventCount hits 200
    Workflow->>Temporal: ContinueAsNew
    Note over Temporal: Signals 2 and 3 abandoned
    Temporal->>Workflow: New run starts (signals 2 and 3 gone)

With the fix in place the sequence changes: after EventCount hits 200, the workflow drains the remaining signals into state before calling ContinueAsNew, so the new run receives all three point events encoded in its initial state.

What to Take From This #

  • selector.Select processes one ready event per call. It does not drain the channel.
  • ReceiveAsync is non-blocking and is the correct tool for draining before ContinueAsNew.
  • Signal loss at the ContinueAsNew boundary produces no error. The workflow continues, the data is gone, and nothing tells you.
  • Every channel the workflow receives on needs to be drained, not just the primary one. Signals on secondary channels at the wrong moment cause the same failure.

The fix is a loop and a guard check. The cost of missing it is customer-visible data loss that’s invisible to the system.


Next: timers have the same boundary problem. A 365-day inactivity timer set in one run doesn’t survive into the next, which means a customer inactive for 364 days can silently get a fresh year’s grace period.


  1. Temporal Go SDK documentation: Workflow message passing — “Before completing the Workflow or using Continue-As-New, make sure to do an asynchronous drain on the Signal channel. Otherwise, the Signals will be lost.” ↩︎