Mid-Turn Checkpointing in a Long-Running Agent Loop

A long agent turn is not the same animal as a long HTTP request. An HTTP request either succeeds or fails and you retry it. An agent turn can be ten tool calls deep, seven minutes in, and have already written files, executed builds, spawned subprocesses, and negotiated approvals with the user. If the daemon running that turn dies at minute nine, you don't get to "retry" — the side effects already happened. The only useful question is: what survives the crash?

This post is about the mid-turn checkpoint design we built into ShanClaw after enough agent stalls and lost turns to justify a dedicated architectural pass. ShanClaw is our Go-based agent runtime — daemon-served, MCP-compatible, native to the Kocoro platform — and it runs long-horizon turns as a matter of course.

Why "Retry the Turn" Is the Wrong Answer

Most agent frameworks treat turn atomicity the way a naive HTTP client treats request atomicity: if the process dies, throw away in-memory state and restart from the last persisted point. For a stateless RPC call, that works. For an agent turn, it's radioactive.

Three reasons:

Tool calls have side effects. The tool that ran at minute 3 created a branch, deleted a file, or POSTed to a webhook. The crash at minute 9 doesn't undo any of it. Restarting the turn from scratch re-executes tools that already ran, usually with stale arguments.
Context is expensive. Seven minutes of LLM output, including five summaries and two reactive compactions, represents real token spend. Throwing it away and restarting isn't just slow — it's paying the bill twice.
User expectations matter. The user saw the tool output at minute 3. The daemon restart shouldn't silently "un-run" what the user already watched.

The only defensible strategy is: persist partial turn state at moments when the work is durably meaningful, then resume from there. That's what mid-turn checkpointing is — and what it isn't is a timer-driven "save every N seconds."

A turn is not uniformly snapshot-able. Saving mid-LLM-stream produces garbage. Saving mid-tool-execution captures a run in an indeterminate state. Checkpoints have to fire on the right events, which means the loop needs an explicit notion of what events are.

Phases: An Explicit Turn State Machine

The core design move is to codify the turn as a phase state machine. Every blocking boundary in the loop declares which phase it's in. The watchdog, the checkpoint hook, and the recovery policy all read from the same model.

Eight active phases plus one transient (InjectingMessage, fired mid-loop when the user types a new message). One transition graph:

Only two phases count as idle — AwaitingLLM and ForceStop. Those are the ones where the loop is strictly waiting on a remote LLM response with no local work to do. The others all have their own blocking owners that already enforce bounds: ExecutingTools is bounded by per-tool timeouts, AwaitingApproval is bounded by the approval broker's own five-minute cap, RetryingLLM is a fixed backoff sleep.

This is the key asymmetry: in a bump-on-event model (the alternative), every blocking caller has to remember to bump the liveness timer. Miss one and you get a silent stall. In the phase model, the watchdog reads the phase and automatically stops counting when the loop enters a non-idle phase — no active remembering required. Adding a new tool type doesn't require adding a new bump point.

In Go, this is literally one predicate:

// internal/agent/phase.go
func (p TurnPhase) CountsAsIdle() bool {
    return p == PhaseAwaitingLLM || p == PhaseForceStop
}

Two phases, no exceptions. The rest of the system — watchdog, checkpoint hook, UI status — reads from this single source of truth.

Checkpoint Boundaries: Not Every Transition

With an explicit phase model, "when should we checkpoint?" becomes answerable. Not every phase exit is durable progress. Some exits fire on every LLM tick and would thrash the session store; others fire on edges where the work is actually persistable.

The design declares three checkpoint-worthy transitions and explicitly rejects the rest:

The idempotency rule is the one that takes the most restraint to write, because diff-append looks cheaper. It isn't cheaper, and it's wrong: reactive compaction can shrink the run message list, so an append-only model drifts out of sync the first time compaction fires. Rebuilding from the loop's canonical list on every checkpoint is the only correct path. Cost is trivial — compaction is already the expensive part.

Three checkpoint sites. Not a timer. The daemon could run for twenty minutes between checkpoints on a tool-heavy turn; that's fine, because the next tool batch completion will fire one, and the state between tool calls is reconstructable from the message history anyway.

The Crash Test That Matters

The test that ultimately justifies this entire design is simple to state and brutal to fail: SIGKILL the daemon between a mid-turn checkpoint and the next one. Reload from disk. The partial state must be there. internal/daemon/checkpoint_crash_test.go:

// Mid-turn checkpoint fires: simulates tool batch completion.
agent.SetRunMessagesForTest(loop, []client.Message{
    {Role: "user",      Content: client.NewTextContent("do thing")},
    {Role: "assistant", Content: client.NewTextContent("[tool_use]")},
    {Role: "user",      Content: client.NewTextContent("[tool_result payload]")},
})
applyTurnState(sess, loop, nil, base)
sess.InProgress = true
mgr.Save()                     // checkpoint to disk

// --- DAEMON CRASHES HERE. No final save. ---
sessionID := sess.ID
mgr.Close()                    // drops in-memory state

// --- Recovery: reload manager + session from disk. ---
mgr2 := session.NewManager(dir)
reloaded, _ := mgr2.Load(sessionID)

// 1. InProgress flag survives on disk.
// 2. Partial transcript is preserved (baseline + tool batch).
// 3. MessageMeta tracks messages (no drift).

Three assertions:

reloaded.InProgress is true. If this flag doesn't survive disk, the UI has no way to distinguish "crashed mid-turn" from "completed normally."
len(reloaded.Messages) == 3. The baseline user message plus the two messages written at the mid-turn checkpoint (assistant [tool_use] and user [tool_result payload]).
len(reloaded.MessageMeta) == len(reloaded.Messages). Metadata arrays stay in sync with the message array. Drift here is the bug category where "a tool was run" loses its timestamp and source attribution on reload.

The test that failed the most times during development was the metadata drift one. Rebuilding Messages from canonical loop state is the easy part; rebuilding MessageMeta to match, without losing pre-turn metadata like the original user-message timestamp, took three iterations to get right. This is one of the quiet reasons the idempotency rule matters: MessageMeta is tracked per-index against Messages; any drift between the two is a data-integrity bug that only surfaces on crash-recovery, i.e., never in normal testing.

The companion test (TestCheckpoint_ResumeAfterCrash_FinalSaveClears) asserts the reverse: once the resumed session completes a clean turn, InProgress must flip to false. The flag has to be a reliable signal, not a sticky one-way marker.

The Watchdog: Liveness, Not Restarts

The checkpoint tells you what survives a crash. The watchdog tells you when a turn is stuck and should become a crash.

One goroutine per turn. Ticks every five seconds. Reads (phase, timeInPhase) from the tracker:

if !phase.CountsAsIdle() {
    continue   // phase owns its own bounding
}
if timeInPhase >= softTimeout && !softFired {
    softFired = true
    emit OnRunStatus("idle_soft", friendlyLabel(phase))
}
if timeInPhase >= hardTimeout {
    cancel(ctx, cause=ErrHardIdleTimeout)   // LLM call aborts, flow falls through
    return
}

Soft timeout (default 90s): emit a UI status event ("still waiting on LLM for 90s…") but take no destructive action. Gives the user signal without killing the turn.

Hard timeout (default 0 — disabled at first rollout): cancel the LLM call, let the recovery dispatcher take over. The default is disabled because a hard-cancel in production is a destructive action; it should be flipped on per-agent or per-user after a dogfood window.

Rearm on transition: entering a new phase resets timeInPhase and clears softFired. The next LLM call gets its own fresh budget. This is a natural consequence of the phase model — nobody has to remember to reset the timer, because transitions are the only thing the watchdog observes.

Recovery: One Dispatch Point

Before this work, turn recovery decisions lived in roughly five places inside Run(): one for retryable LLM errors, one for context-length errors, one for loop detector verdicts, one for nudge exhaustion, one for the explicit stop path. Each place had slightly different state checks. Adding a new recovery condition meant auditing all five.

The phase design folds them into one predicate:

type recoveryAction int

const (
    actionRetryLLM         recoveryAction = iota
    actionCompactThenRetry
    actionNudgeThenContinue
    actionForceStop
    actionAbort
)

func decideRecovery(phase TurnPhase, err error, state *loopState) recoveryAction {
    switch {
    case errors.Is(err, ErrHardIdleTimeout):      return actionAbort
    case isContextLengthError(err) && !state.reactiveCompacted:
                                                  return actionCompactThenRetry
    case isRetryableLLMError(err) && state.retryAttempt < 3:
                                                  return actionRetryLLM
    case state.loopDetector.ShouldStop():         return actionForceStop
    case state.nudgeCount < 3:                    return actionNudgeThenContinue
    default:                                      return actionAbort
    }
}

The ordering is the semantics. First matching case wins, so the sequence itself encodes priority: hard-timeout > context-compaction > retryable > loop-stop > nudge > abort. Re-ordering the cases changes the recovery policy, which is exactly what you'd want when changing the recovery policy.

The important property isn't that the switch statement is short. It's that adding a new recovery condition — say, a new "rate-limit backoff" action — now happens in exactly one place. Whoever adds it has to think about its relative priority against the five existing cases; they can't accidentally slip it in under one branch of a conditional and forget the others.

Five scattered decisions become one predicate. One predicate is testable.

What This Doesn't Solve

Honest list. Checkpointing is not a universal save button. Four gaps the current design explicitly leaves open:

In-flight streaming tokens are not preserved. A checkpoint fires between phases. The tokens streaming from the LLM at the moment of crash are lost. On recovery, the resume has to request a fresh LLM call — it cannot "pick up where the stream left off." This is probably fine: those tokens hadn't been acted on anyway.
Tool side effects outside ShanClaw are untouched. If the tool that ran at minute three created a GitHub issue, the issue still exists on restart. Checkpointing persists the agent's record of that side effect, not the side effect itself. The responsibility for idempotent tool calls lives in the tool, not the loop.
Subagent/cloud-delegate turns are checkpointed at the parent granularity. A parent turn that dispatches a 30-minute cloud agent will checkpoint after the delegate returns, not during. If the cloud agent itself dies, that's a separate recovery story owned by the cloud runtime.
Checkpoint conflicts with concurrent user input. If the user types a new message in the middle of a reload, the resume has to reconcile "what was saved pre-crash" with "what the user is adding post-reload." Current design: the pre-crash transcript wins as the baseline; new input appends. Not elegant, but deterministic.

Documenting these in the design doc, not hiding them, was a deliberate choice. A checkpoint system that silently papers over the first three gaps is worse than one that fires loudly.

The Lesson

Every long-running agent framework eventually has to answer: what happens if the process dies mid-turn?

The wrong answer is "implicit liveness" — timers scattered across blocking calls, turn restarts on failure, append-only persistence. It works until it doesn't, and the failures are exactly the cases you can't reproduce in testing.

The right answer is an explicit phase model. Every blocking boundary declares its phase. The watchdog reads phases, not timer bumps. Checkpoints fire on phase transitions, not wall-clock ticks. Recovery dispatches from a single predicate. When all four subsystems read from the same model, they stay aligned; when any one of them has its own parallel notion of turn state, they drift.

If you're writing an agent runtime and haven't named your turn phases yet, that's the work to do before any of the rest makes sense. The code to add is not large — maybe 200 lines of Go across three files. The architectural move is naming the phases and committing to the contract.

The first invariant to audit in any agent loop: does its crash recovery depend on pristine in-memory state, or does it depend on something that survived mgr.Close()? If the former, you don't have recovery. You have a hope.

Kocoro-lab/ShanClaw

Local-first AI agent runtime — daemon-served, MCP-compatible, native to the Kocoro platform. Built in Go.

View on GitHub