Byte-Stability Tests for Prompt Caching

Turn on Anthropic's prompt cache. Your cost per cached request drops by up to 90%. It feels like free money. Two weeks later you ship a patch, your hit rate quietly falls to zero, and you only notice when the monthly bill arrives.

The problem isn't the cache. It's that almost nothing else in modern software engineering is byte-fragile. Logs tolerate whitespace; JSON tolerates key reordering; gRPC tolerates field additions. Prompt caches tolerate nothing. A single trailing newline in a system prompt is functionally equivalent to deleting the cache entry.

This post is about what it takes to keep a prompt cache alive as the codebase evolves — specifically, the byte-stability testing discipline we built into Shannon after enough near-misses to justify a dedicated test suite.

Why Prompt Caching Is Byte-Fragile

Anthropic's prompt cache keys on a prefix hash. When you send a request with cache_control blocks, the provider computes a hash over the bytes of the prefix up to each breakpoint. A cache hit requires every one of those bytes to match a prior write, exactly.

"Exactly" is doing heavy lifting in that sentence. Consider two requests that a human would call identical:

Two requests, one byte apart. The cache treats them as unrelated. Each miss costs 1.25× input tokens — every call, until someone notices on the monthly invoice.

The drift sources are unglamorous and numerous:

Tool description edits. The most common by a wide margin. Someone updates description: "Search the web" to description: "Search the web for factual information". Every active session misses cache on the next turn.
Tool ordering. A new tool gets registered; its list position comes from the order of some dict's .keys(). What changes: the tool block in the prefix re-orders. Why it matters: prefix hash depends on byte order of the entire tool array, so a reorder downstream of position 0 invalidates every byte from that point forward.
Whitespace in templates. A strip() added (or removed) from a prompt template. A \n at the end of a Jinja block.
Timestamp or session ID injection into the system prompt "for debugging."
Model changes. Upgrading claude-sonnet-4-6 to claude-sonnet-4-7 swaps cache pools — old sessions miss on their next return.

Each of these reads as a trivial one-line code change. Each is a cache-invalidation event for every live session touching that path.

The Fan-In Problem

A single system doesn't have one place that composes prompts. It has dozens.

Shannon's Python LLM service currently has around twenty distinct call sites that invoke the Anthropic provider — each one labeled with a cache_source string that doubles as a cache-TTL routing hint (anthropic_provider.py):

The real threat model isn't malicious bugs. It's ordinary ones. A refactor in tool_select that changes how tools are serialized breaks cache for every agent that ever reaches the tool-selection loop — and code review is about logic, not byte equality.

Since every path funnels through the same serialization boundary, that's where the defense has to live: one place that enforces invariants on every payload, regardless of which PR touched which caller.

Four Patterns That Keep the Prefix Stable

Shannon enforces cache stability through four complementary patterns, each with tests that fail in CI if the invariant breaks.

1. Tool Schema Freeze

Once a tool set is built for the first time, Shannon freezes the resulting schemas by name-set. Subsequent calls with the same tool names return the frozen copy, even if descriptions have drifted mid-session. From anthropic_provider.py:600:

key = str(sorted(
    (f.get("name") or (f.get("function") or {}).get("name", "") or "",
     bool(f.get("defer_loading") or ...))
    for f in functions
))

if self._frozen_tools and self._frozen_tools_key == key:
    return [dict(t) for t in self._frozen_tools]

The test that enforces this is blunt:

def test_same_tools_return_frozen_copy(self):
    tools1 = provider._convert_functions_to_tools(functions)
    tools2 = provider._convert_functions_to_tools(functions_v2)  # same names, drifted descriptions
    assert tools1[0]["description"] == tools2[0]["description"]

If someone refactors the function and forgets to preserve the freeze, this test fails loudly in CI. Not at 3 AM on the cost dashboard.

2. Deterministic Tool Ordering

Tools are sorted by name before serialization. Always. This makes the tool block in the prefix invariant to registration order, Python dict iteration, or any other source of non-determinism:

tools.sort(key=lambda t: t["name"])

One line. Gated by test_tools_sorted_by_name. A future contributor who adds tools in "logical grouping order" will break the test and have to justify why they're undoing the stability guarantee.

3. Cache-Break Detector

Every provider instance carries a CacheBreakDetector that snapshots (system_text, tool_names, model) at each request and diffs consecutive calls. When it detects a break, it logs a structured event — {"changed": ["system"], "tools_added": [...], "tools_removed": [...]} — so the break is visible in telemetry before it's visible on the bill.

This is observability, not prevention. But it turns "we noticed the bill was high" into "we saw the break in the logs three hours after the deploy and rolled back."

4. TTL-Aware Serialization Tests

The cache-control blocks themselves need byte stability. Shannon's tests assert, at the wire level, that a request with the same inputs produces the same serialized payload — down to the cache_control placement and the ephemeral TTL marker.

This is the test that matters most. It's also the one that looks the most boring in a review: a golden-file comparison of a request body. Boring tests catch the interesting bugs.

The Companion Problem: Not Every Cache Write Is Worth It

Byte stability keeps the cache alive. The second question is whether you want the write in the first place.

Anthropic's pricing: writing to the 5-minute cache costs 1.25× input tokens. Writing to the 1-hour cache costs 2× input tokens. Reads are 0.1× in both cases. The 1-hour option only pays off if the prefix will actually be re-read within the hour.

For most AI-agent workloads, the answer is "usually not":

A cron-triggered webhook processor: one shot, prefix discarded.
An internal subagent spawn: runs once per parent turn, then gone.
A single-use classification endpoint: cache write burns money with no reader.

For some paths, the answer is "absolutely":

An interactive terminal (TUI) where a user is thinking between turns.
A chat channel (Slack, LINE, Feishu, Telegram) with typical human pacing.
A long-running CLI agent where the same base prompt amortizes across many turns.

Shannon routes this decision from a single table in anthropic_provider.py:

_LONG_CACHE_SOURCES = frozenset({
    "slack", "line", "feishu", "lark", "telegram",
    "tui", "shanclaw", "oneshot_interactive", "cache_bench",
})

def _ttl_block(request) -> Optional[Dict[str, str]]:
    # 1. Operator escape hatch
    force = os.environ.get("SHANNON_FORCE_TTL", "").strip().lower()
    if force == "off":  return None
    if force == "5m":   return CACHE_TTL_SHORT
    if force == "1h":   return CACHE_TTL_LONG

    # 2. Source-based routing
    src = (getattr(request, "cache_source", None) or "").strip().lower()
    if src in _LONG_CACHE_SOURCES:
        return CACHE_TTL_LONG
    return CACHE_TTL_SHORT

Three routing outcomes, each with a principled rationale:

Unknown source → 5-minute TTL. This is the "fail cheap" default: if the call happens to be in a cache-friendly path the write gets amortized inside 5 minutes; if it doesn't, the premium is 1.25× instead of 2×. You lose a little on the upside, you lose much less on the downside. The default is aligned with the asymmetry of the actual cost structure.

The operator escape hatch (SHANNON_FORCE_TTL=off) exists for the case every infrastructure engineer eventually encounters: we need to turn this off globally, right now, from a flag, without a deploy.

The Lesson

Prompt caches are infrastructure, not optimization.

Treating them as an optimization is what produces the "surprise cost spike after a routine PR" story. The implicit mental model is: I write my code, the cache happens, if it doesn't happen the worst case is I pay full price. That model is wrong in both directions. First, the cache stops happening silently — there's no error, no exception, no test failure, just a slow degradation of the hit rate. Second, the worst case isn't paying full price; it's paying more than full price, because you're now paying the write premium (1.25× or 2×) on every miss.

Treating prompt caches as infrastructure means:

The prompt prefix is an API contract. Changing it is a versioning event, not a code edit.
Byte stability is a CI concern. Tests fail when drift happens, in the PR that caused it, before merge.
TTL routing is a policy layer. The right TTL depends on the call pattern, not the engineer's optimism about reuse.
The observability story is primary, not an afterthought. You want to see drift in structured logs, not on a finance dashboard.

None of this is exotic. It's the same discipline that applies to database schemas, public API surfaces, and serialization formats. Prompt caches just happen to be the newest entry in the "byte-stable contracts" category, and the field is still collectively learning to treat them that way.

If you're shipping prompt caching today, start with three primitives:

Tool schema freeze — pin the serialized schema to the tool name set, immune to description drift.
Deterministic tool ordering — sort by name, always, regardless of registration order.
Byte-snapshot test of the wire payload — one golden-file test that compares the exact serialized request body. Fails in CI when anything shifts, before the bill does.

Everything else — TTL routing, break detectors, source tables — layers on top.

The test file worth copying first is tests/test_anthropic_cache.py in Shannon. Nearly 1,000 lines for a prompt-cache test suite feels absurd until the first time it catches a drift that would have cost four figures.

Kocoro-lab/Shannon

Production-Grade Multi-Agent Platform — Built with Rust, Go, and Python for deterministic execution, budget enforcement, and enterprise-grade observability.

View on GitHub