The AI Agent Harness: How ShanClaw Evolved After Claude Code Went Public

April 4, 2026

"Harness" is the word everyone in AI engineering is reaching for right now. It means the orchestration layer that wraps a foundation model — the tool dispatch loop, permission system, context management, loop detection — everything that isn't the model weights. Claude Code popularized the term. OpenAI's Codex CLI validated it. And the takeaway from both is the same: a thin, well-designed harness with a strong model outperforms a complex harness with a weaker one.

I didn't learn this from Claude Code. I learned it by building ShanClaw — a Go agent runtime that connects to Shannon Cloud for multi-agent delegation. ShanClaw's harness was already running in production — processing messages from Slack, LINE, and Telegram, executing Mac file ops and GUI automation — before Claude Code's source became public.

When the source leaked, I mapped every subsystem into a detailed diagram — the query loop, the 5-layer context compaction, the 4-way permission race, the tool runtime with 40+ tools. Some patterns matched what I'd already built. Others — especially around prompt caching and tool schema ordering — taught me things I immediately brought back to ShanClaw.

What follows is the comparison. Not which is "better" — they solve different problems. But for anyone building agent runtimes, understanding where two independent teams converged (and diverged) reveals which patterns are fundamental.


What a Harness Actually Does

Strip away the marketing and an agent harness has five jobs:

  1. System prompt assembly — Build context from static rules, session state, and per-turn volatiles
  2. Tool dispatch — Parse model output for tool calls, execute them, feed results back
  3. Permission enforcement — Decide what the agent can do without human approval
  4. Context management — Keep the conversation within the context window as it grows
  5. Loop detection — Catch the agent when it's stuck repeating itself

Claude Code and ShanClaw both implement all five. The differences are in the details.


The Query Loop: Message-Driven vs. Iteration-Capped

Claude Code's core is query.ts — a streaming loop that sends messages to the Anthropic API, detects tool_use blocks in the response, executes them, appends tool_result, and loops until the model returns pure text. The loop is message-driven: it continues as long as the model keeps requesting tool calls.

ShanClaw's loop.go takes a different approach: iteration-capped with a default limit of 25 (higher for GUI tasks). Each iteration does more work — draining injected mid-run messages, polling for temporal state changes (date rollover detection), filtering old screenshots, compressing stale tool results before the API call. The cap exists because ShanClaw runs as a daemon processing messages from Slack, LINE, and Telegram. An uncapped loop on a message from a Slack channel could burn through your API budget while you sleep.

The architectural lesson: if your harness serves interactive users, let the model drive. If it serves autonomous channels, cap the iterations.


Tool Registration: Static vs. Layered

Claude Code registers tools statically in tools.ts — a flat list with feature gates and stable ordering for prompt cache friendliness. Simple, effective, cache-optimized.

ShanClaw uses layered registration in three stages:

  1. Local tools — 28 built-in tools (file ops, shell, GUI automation) registered at daemon start
  2. MCP tools — External capabilities connected with a 45-second timeout, with conflict resolution (Playwright MCP connecting removes legacy browser tools)
  3. Gateway tools — Allowlisted Shannon Cloud tools (web search, analytics, financial data)

Each agent run gets a CloneWithRuntimeConfig() deep copy so mutable tools (Bash, CloudDelegate) don't race across concurrent daemon sessions. Per-agent tool filters apply last.

Why the complexity? ShanClaw runs 5 concurrent agent sessions. A Slack message and a Telegram message hitting the daemon simultaneously need isolated tool state. Claude Code doesn't have this problem — it's one user, one session.


What Claude Code Taught Me: Prompt Cache Discipline

This is the biggest thing I brought back from studying Claude Code's source.

Claude Code treats prompt cache hit rate as a first-class architectural concern. Tool schemas are sorted in a stable, deterministic order. The system prompt is split at a DYNAMIC_BOUNDARY — everything before it is static and cacheable, everything after changes per turn. Deferred tools keep the schema payload small. Even the tool_result format is designed to avoid cache-busting.

ShanClaw already had the three-part prompt split — System (static persona + rules), StableContext (per-session sticky facts), VolatileContext (per-turn state) — separated by <!-- cache_break --> markers. But after studying Claude Code, I made three specific improvements:

  1. Deterministic tool ordering — Tools now sort by source (local → MCP → gateway), then alphabetically within each source. Previously the order could shift between sessions depending on MCP server connection timing.
  2. Cache miss streak logging — ShanClaw now logs to stderr after 3 consecutive prompt cache misses. This catches regressions where a code change accidentally destabilizes the static prefix.
  3. Deferred tool loading — When tool count exceeds 30, MCP and gateway tools become name+description summaries. A tool_search meta-tool loads full schemas on demand, persisted in a WorkingSet across turns. Claude Code does the same via ToolSearch with feature flags. The pattern is identical: lazy-load tool schemas to protect cache efficiency.

These aren't big architectural changes. They're discipline. And they cut ShanClaw's API costs measurably — prompt cache hits went from ~60% to ~85% after adopting stable ordering alone.


Tool Execution: Parallel Batching

This is where both harnesses converge on the same answer independently — no borrowing, just the same problem forcing the same solution.

Claude Code's toolOrchestration.ts groups consecutive read-only tool calls into concurrent batches. Write operations run sequentially. ShanClaw's partition.go does the same thing — tools implementing ReadOnlyChecker get batched, everything else runs one at a time. Maximum concurrency: 10 (semaphore-capped).

When you let the model request multiple tool calls per turn, you need a partitioning strategy. The only safe partition is: reads in parallel, writes in sequence.


Permission: 4-Way Race vs. 5-Layer Evaluation

Claude Code runs a "4-way race" for permission decisions — User approval, Hook verdict, Bash Classifier (LLM-based safety judgment), and Bridge control all compete, and the first claim wins via a ResolveOnce atomic.

ShanClaw uses a 5-layer waterfall (first match wins):

  1. Hard blocksrm -rf /, curl | sh — cannot be overridden
  2. Denied commands — User-configured blocklist
  3. Shell AST parsing — Commands with &&, ||, |, ; reject auto-approval
  4. Allowed commands — User allowlist + default safe set
  5. Tool-level checksRequiresApproval() + SafeChecker interface

No LLM in the permission path. This was a deliberate choice — I don't want a model deciding whether a command is safe while another model is requesting that command. Shell AST parsing catches the dangerous patterns deterministically.

For daemon mode, there's an additional layer: ApprovalBroker sends approval requests over WebSocket to Shannon Cloud, which relays them to the desktop app. Users approve from their phone. 5-minute timeout, then auto-deny.


Context Compaction: 5 Layers vs. 3 Tiers

Claude Code implements five distinct compaction layers:

  1. Snip compact — Replace old tool results with [snipped] markers
  2. Microcompact — Strip old tool results entirely
  3. Context collapse — Fold completed sub-conversations
  4. Auto-compact — Threshold-triggered summarization
  5. Reactive — Emergency compaction on context-length API errors

ShanClaw takes a simpler approach with three tiers applied per-turn:

  • Tier 1 (>10 messages old): Metadata only
  • Tier 2 (3-10 messages old): Head + tail of results
  • Tier 3 (0-2 messages old): Full content

Plus proactive compaction at 85% of context window — a two-phase LLM call that extracts learnings into MEMORY.md before generating a summary. Emergency compaction (100 chars/result) fires on context-length errors as a last resort.

Claude Code's approach is more granular. ShanClaw's is more aggressive. The trade-off: ShanClaw's daemon sessions can run for hours across dozens of messages without hitting context limits, but early context gets lossy faster. For a coding assistant where you're actively watching, Claude Code's gradual degradation makes more sense. For a daemon processing async messages, aggressive compaction is the right call.


Loop Detection: The Hardest Problem

Both harnesses recognize that LLMs get stuck. The solutions differ significantly.

Claude Code handles this implicitly through its loop structure — the model drives continuation, and the harness trusts it to stop. Hooks and stop conditions provide guardrails.

ShanClaw implements 9 explicit detectors running on a sliding window of 20 recent tool calls:

  1. Tool mode switch — Visual tool after successful GUI tool (unnecessary verification)
  2. Success after error — Visual tool after error recovery
  3. Consecutive duplicate — Back-to-back identical calls (threshold: 2)
  4. Exact duplicate — Same call spread across window (threshold: 3)
  5. Same tool error — Same tool erroring repeatedly (threshold: 4)
  6. Family no-progress — Same-topic web calls (3→nudge, 5→stronger, 7→force stop)
  7. Search escalation — Unproductive search streak (5→nudge, 8→force stop)
  8. No progress — Same tool too many times (threshold: 8)
  9. Sleep detection — Bash sleep commands (2→nudge, 4→force stop)

Two response levels: LoopNudge injects a correction message. LoopForceStop forces a final response.

Why 9 detectors? Because ShanClaw processes messages from channels where no human is watching in real-time. A Slack bot stuck in a search loop at 3 AM is burning money until someone notices. The detection has to be proactive.


The Cloud Delegation Pattern

This is where ShanClaw fundamentally diverges from Claude Code. Claude Code is self-contained — it runs locally, calls the API, executes tools. ShanClaw adds a cloud_delegate tool that bridges to Shannon Cloud's multi-agent system:

  1. Submit a task request via REST
  2. Get back a workflow ID and SSE stream URL
  3. Consume events: agent starts/completions, research plans, task progress, streaming deltas
  4. For research workflows, the cloud result bypasses local LLM summarization entirely — it's the final deliverable

Three workflow types: research (multi-source deep research), swarm (dynamic agent coordination), auto (fixed DAG with parallel subtasks).

The local harness handles what needs local access — file operations, shell execution, GUI automation. Complex reasoning and multi-agent coordination delegate up to the cloud. This is the pattern I think matters: local execution + cloud intelligence, not one or the other.


GUI Automation: Where Go Shines

Claude Code doesn't do native GUI automation. ShanClaw does — it's a core design goal.

The AXClient manages a persistent ax_server child process over JSON-RPC. Two transport modes: bundled (inside .app bundle, Unix domain socket, required for macOS TCC permissions) and fallback (bare binary, stdin/stdout pipes for development).

The AccessibilityTool provides ref-based GUI interaction — read_tree, click, press, set_value, find, scroll, annotate. Ref-based means stable element identifiers instead of coordinates. The ComputerTool handles coordinate-based fallback. AppleScriptTool covers operations with no accessibility equivalent.

When Playwright MCP connects, the harness removes legacy browser tools automatically — no duplicate capabilities, no confused model.


What I Think Matters Most

After building ShanClaw's harness from scratch and then studying Claude Code's implementation in detail, three things stand out:

1. The harness is the product. The model is a commodity — you can swap Claude for GPT for Gemini. The harness determines what the agent can actually do, how safely it does it, and how it recovers from failures. Anyone still focused on "which model is best" is asking the wrong question.

2. Deployment model drives architecture. Claude Code's simplicity comes from serving one user interactively. ShanClaw's complexity comes from serving multiple async channels concurrently. Neither is wrong. The harness must match the deployment model.

3. Independent convergence reveals fundamentals. Read-write partitioning for parallel tool execution. Deferred tool loading past a threshold. Tiered context compaction. Progressive disclosure over monolithic prompts. When teams that never talked to each other build the same patterns, those patterns are probably load-bearing truths of agent design. And when you find a practice the other team did better — like Claude Code's cache discipline — adopt it immediately.

The harness pattern is still young. We're all figuring it out. But the shape is becoming clear: a thin orchestration layer, a robust permission system, aggressive context management, and explicit loop detection. Everything else is details.

Kocoro-lab/ShanClaw

AI agent runtime powered by Shannon — Mac file ops, shell, GUI automation — with complex task delegation via Shannon Cloud workflows

View on GitHub
Kocoro-lab/Shannon

Production-Grade Multi-Agent Platform — Built with Rust, Go, and Python

View on GitHub

I also made a video walkthrough of the Claude Code architecture analysis (in Chinese): Bilibili.