Chapter 22: Observability
Observability makes your system transparent like glass—using metrics to quantify performance, traces to locate bottlenecks, logs to record details. But over-observing also has costs: metric explosion, storage costs, performance overhead. Choosing what to observe is more important than observing itself.
Quick Path (5 minutes to grasp the core)
- Three pillars have different strengths: Metrics for trends, Traces for call chains, Logs for details
- Prometheus metric naming uses
{system}_{module}_{action}_{unit}format- OpenTelemetry Trace chains cross-service calls, uses Spans to record each hop
- Structured logs must include trace_id for correlating all three
- Sampling strategy: 10-20% in production, up to 50% for critical paths
10-minute path: 22.1-22.3 -> 22.5 -> Shannon Lab
Monday morning, user complains: "Agent response is too slow, was fine last week."
You open logs, search for an hour:
- Did LLM API slow down, or is our code slow?
- Are all requests slow, or just certain task types?
- Is token consumption normal or abnormal?
- Which Agent takes the most time? What's the failure rate?
The answer is: Don't know, can only guess.
This is why you need observability.
22.1 Three Pillars of Observability
Each has different strengths, complementing each other:
| Pillar | Strengths | Weaknesses | Typical Question |
|---|---|---|---|
| Metrics | Aggregation, trends, alerting | Don't know single request details | "What's the failure rate?" |
| Traces | Full chain for single request | High storage cost | "Where is this request stuck?" |
| Logs | Detailed context | Hard to aggregate and analyze | "What were the parameters when it errored?" |
22.2 Prometheus Metrics
Why Choose Prometheus?
- Time series database, naturally suited for performance metrics
- Pull mode, services don't need to know monitoring system exists
- Powerful query language PromQL
- Mature ecosystem: Grafana, Alertmanager
Shannon's Metrics System
Shannon defines Pattern-level metrics in go/orchestrator/internal/workflows/metrics/pattern_metrics.go:
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Pattern execution counter
PatternExecutions = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "shannon_pattern_executions_total",
Help: "Total number of pattern executions by type",
},
[]string{"pattern", "workflow_version"},
)
// Pattern execution duration
PatternDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "shannon_pattern_duration_seconds",
Help: "Duration of pattern executions in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"pattern", "workflow_version"},
)
// Agent executions per Pattern
AgentExecutionsByPattern = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "shannon_agents_by_pattern_total",
Help: "Total number of agents executed by pattern type",
},
[]string{"pattern", "workflow_version"},
)
// Token usage
TokenUsageByPattern = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "shannon_tokens_by_pattern_total",
Help: "Total tokens used by pattern type",
},
[]string{"pattern", "workflow_version"},
)
// Reflection improvement count
ReflectionImprovements = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "shannon_reflection_improvements_total",
Help: "Number of times reflection improved quality",
},
[]string{"workflow_version"},
)
)
Metric Categories
What metrics does a complete Agent system need?
1. Workflow-Level Metrics
// Workflow completion count
WorkflowsCompleted = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "shannon_workflows_completed_total",
Help: "Total number of workflows completed",
},
[]string{"workflow_type", "mode", "status"}, // research/react/dag, sync/async, success/failed
)
// Workflow latency distribution
WorkflowDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "shannon_workflow_duration_seconds",
Help: "Workflow execution duration in seconds",
Buckets: []float64{1, 5, 10, 30, 60, 120, 300}, // 1s to 5min
},
[]string{"workflow_type", "mode"},
)
2. Token/Cost Metrics
// Token consumption per task
TaskTokensUsed = promauto.NewHistogram(
prometheus.HistogramOpts{
Name: "shannon_task_tokens_used",
Help: "Number of tokens used per task",
Buckets: []float64{100, 500, 1000, 5000, 10000, 50000},
},
)
// Cost (USD)
TaskCostUSD = promauto.NewHistogram(
prometheus.HistogramOpts{
Name: "shannon_task_cost_usd",
Help: "Cost in USD per task",
Buckets: []float64{0.001, 0.01, 0.05, 0.1, 0.5, 1, 5},
},
)
3. Memory System Metrics
// Memory fetch operations
MemoryFetches = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "shannon_memory_fetches_total",
Help: "Total number of memory fetch operations",
},
[]string{"type", "source", "result"}, // session/semantic, qdrant/redis, hit/miss
)
// Compression ratio
CompressionRatio = promauto.NewHistogram(
prometheus.HistogramOpts{
Name: "shannon_compression_ratio",
Help: "Compression ratio (original/compressed)",
Buckets: []float64{1.5, 2, 3, 5, 10, 20},
},
)
4. Vector Search Metrics
// Vector search count
VectorSearches = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "shannon_vector_search_total",
Help: "Total number of vector searches",
},
[]string{"collection", "status"},
)
// Vector search latency
VectorSearchLatency = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "shannon_vector_search_latency_seconds",
Help: "Vector search latency in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"collection"},
)
Metric Recording
Shannon encapsulates metric recording in activities/metrics.go:
// PatternMetricsInput contains pattern execution metrics
type PatternMetricsInput struct {
Pattern string
Version string
AgentCount int
TokensUsed int
Duration time.Duration
Improved bool // For reflection pattern
WorkflowType string
}
// RecordPatternMetrics records pattern execution metrics
func RecordPatternMetrics(ctx context.Context, input PatternMetricsInput) error {
// Record pattern execution
metrics.RecordPatternExecution(input.Pattern, input.Version)
// Record duration if provided
if input.Duration > 0 {
metrics.RecordPatternDuration(input.Pattern, input.Version, input.Duration.Seconds())
}
// Record agent executions
if input.AgentCount > 0 {
metrics.RecordAgentExecution(input.Pattern, input.Version, input.AgentCount)
}
// Record token usage
if input.TokensUsed > 0 {
metrics.RecordTokenUsage(input.Pattern, input.Version, input.TokensUsed)
}
// Record reflection improvement
if input.Pattern == "reflection" && input.Improved {
metrics.RecordReflectionImprovement(input.Version)
}
return nil
}
Metrics Server
go func() {
http.Handle("/metrics", promhttp.Handler())
port := cfg.MetricsPort(2112)
addr := ":" + fmt.Sprintf("%d", port)
logger.Info("Metrics server listening", zap.String("address", addr))
if err := http.ListenAndServe(addr, nil); err != nil {
logger.Error("Failed to start metrics server", zap.Error(err))
}
}()
22.3 OpenTelemetry Tracing
Why Need Tracing?
Metrics tell you "system is slow overall", but don't tell you "where is this request stuck".
Tracing solves this: records the complete path from when a request enters the system to when it returns.
Initialization
Shannon initializes tracing in go/orchestrator/internal/tracing/tracing.go:
package tracing
import (
"context"
"fmt"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/semconv/v1.27.0"
oteltrace "go.opentelemetry.io/otel/trace"
"go.uber.org/zap"
)
var tracer oteltrace.Tracer
// Config holds tracing configuration
type Config struct {
Enabled bool `mapstructure:"enabled"`
ServiceName string `mapstructure:"service_name"`
OTLPEndpoint string `mapstructure:"otlp_endpoint"`
}
// Initialize sets up minimal OTLP tracing
func Initialize(cfg Config, logger *zap.Logger) error {
// Always initialize a tracer handle
if cfg.ServiceName == "" {
cfg.ServiceName = "shannon-orchestrator"
}
tracer = otel.Tracer(cfg.ServiceName)
if !cfg.Enabled {
logger.Info("Tracing disabled")
return nil
}
if cfg.OTLPEndpoint == "" {
cfg.OTLPEndpoint = "localhost:4317"
}
// Create OTLP exporter
exporter, err := otlptracegrpc.New(
context.Background(),
otlptracegrpc.WithEndpoint(cfg.OTLPEndpoint),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return fmt.Errorf("failed to create OTLP exporter: %w", err)
}
// Create resource
res, _ := resource.New(context.Background(),
resource.WithAttributes(
semconv.ServiceName(cfg.ServiceName),
semconv.ServiceVersion("1.0.0"),
),
)
// Create tracer provider
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(res),
)
otel.SetTracerProvider(tp)
tracer = otel.Tracer(cfg.ServiceName)
logger.Info("Tracing initialized", zap.String("endpoint", cfg.OTLPEndpoint))
return nil
}
Span Creation
// StartSpan creates a new span with the given name
func StartSpan(ctx context.Context, spanName string) (context.Context, oteltrace.Span) {
return tracer.Start(ctx, spanName)
}
// StartHTTPSpan creates a span for HTTP operations with method and URL
func StartHTTPSpan(ctx context.Context, method, url string) (context.Context, oteltrace.Span) {
if tracer == nil {
tracer = otel.Tracer("shannon-orchestrator")
}
spanName := fmt.Sprintf("HTTP %s", method)
ctx, span := tracer.Start(ctx, spanName)
span.SetAttributes(
semconv.HTTPRequestMethodKey.String(method),
semconv.URLFull(url),
)
return ctx, span
}
W3C Traceparent Propagation
Cross-service calls need to propagate Trace Context:
// W3CTraceparent generates a W3C traceparent header value
func W3CTraceparent(ctx context.Context) string {
span := oteltrace.SpanFromContext(ctx)
if !span.SpanContext().IsValid() {
return ""
}
sc := span.SpanContext()
return fmt.Sprintf("00-%s-%s-%02x",
sc.TraceID().String(),
sc.SpanID().String(),
sc.TraceFlags(),
)
}
// InjectTraceparent adds W3C traceparent header to HTTP request
func InjectTraceparent(ctx context.Context, req *http.Request) {
if traceparent := W3CTraceparent(ctx); traceparent != "" {
req.Header.Set("traceparent", traceparent)
}
}
Cross-Service Tracing Example
func (a *Activities) ExecuteAgent(ctx context.Context, input AgentExecutionInput) (...) {
// Create HTTP Span
ctx, span := tracing.StartHTTPSpan(ctx, "POST", llmServiceURL)
defer span.End()
// Build request
req, _ := http.NewRequestWithContext(ctx, "POST", llmServiceURL, body)
// Inject Trace Context
tracing.InjectTraceparent(ctx, req)
// Execute request
resp, err := client.Do(req)
// Record result
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
}
return result, nil
}
This way, the entire chain from Orchestrator to LLM Service can be viewed in Jaeger or other tracing systems.
22.4 Structured Logging
Zap Logger Configuration
Shannon uses Zap as the logging library:
logger, err := zap.NewProduction()
if err != nil {
log.Fatalf("Failed to initialize logger: %v", err)
}
defer logger.Sync()
Logging Best Practices
1. Include Context
// Good logging: includes sufficient context
logger.Info("Workflow started",
zap.String("workflow_id", workflowID),
zap.String("workflow_type", workflowType),
zap.String("session_id", sessionID),
zap.Int("subtasks", len(subtasks)),
)
// Bad logging: lacks context
logger.Info("Workflow started") // Which workflow?
2. Error Logs Should Be Detailed
logger.Error("Agent execution failed",
zap.Error(err),
zap.String("agent_id", agentID),
zap.String("workflow_id", workflowID),
zap.Int("attempt", attempt),
zap.Duration("duration", duration),
)
3. Use Warn for Performance Issues
logger.Warn("Slow LLM response",
zap.Duration("duration", duration),
zap.String("model", modelUsed),
zap.Int("tokens", tokensUsed),
)
4. Control Debug Log Volume
// Summarize rather than log each one
logger.Debug("Processing items",
zap.Int("count", len(items)),
zap.Duration("total_duration", totalDuration),
)
22.5 Health Checks
Health Manager
Shannon's health check system is in go/orchestrator/internal/health/manager.go:
// Manager implements the HealthManager interface
type Manager struct {
checkers map[string]*CheckerState
lastResults map[string]CheckResult
config *HealthConfiguration
started bool
checkInterval time.Duration
stopCh chan struct{}
logger *zap.Logger
mu sync.RWMutex
}
// CheckerState represents the runtime state of a health checker
type CheckerState struct {
checker Checker
enabled bool
interval time.Duration
timeout time.Duration
critical bool // Critical dependency vs non-critical dependency
lastCheck time.Time
}
Key design: distinguish between Critical and Non-Critical checks.
// Calculate overall health status
func (m *Manager) calculateOverallStatus(components map[string]CheckResult, summary HealthSummary) OverallHealth {
criticalFailures := 0
nonCriticalFailures := 0
degradedComponents := 0
for _, result := range components {
if result.Status == StatusUnhealthy {
if result.Critical {
criticalFailures++
} else {
nonCriticalFailures++
}
}
}
// Only critical dependency failures are judged as Unhealthy
if criticalFailures > 0 {
return OverallHealth{
Status: StatusUnhealthy,
Message: fmt.Sprintf("%d critical component(s) failing", criticalFailures),
Ready: false,
Live: true, // Still alive, just not ready
}
} else if nonCriticalFailures > 0 {
return OverallHealth{
Status: StatusDegraded, // Degraded but usable
Message: fmt.Sprintf("%d non-critical component(s) failing", nonCriticalFailures),
Ready: true,
Live: true,
}
}
// ...
}
Registering Checkers
// Create health manager
hm := health.NewManager(logger)
// Register various checkers
if dbClient != nil {
dbChecker := health.NewDatabaseHealthChecker(dbClient.GetDB(), dbClient.Wrapper(), logger)
_ = hm.RegisterChecker(dbChecker)
}
// Redis check
if rw := orchestratorService.SessionManager().RedisWrapper(); rw != nil {
rc := health.NewRedisHealthChecker(rw.GetClient(), rw, logger)
_ = hm.RegisterChecker(rc)
}
// Agent Core check
if agentAddr != "" {
conn, err := grpc.Dial(agentAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
if err == nil {
client := agentpb.NewAgentServiceClient(conn)
ac := health.NewAgentCoreHealthChecker(client, conn, logger)
_ = hm.RegisterChecker(ac)
}
}
// LLM Service check
lc := health.NewLLMServiceHealthChecker(llmBase, logger)
_ = hm.RegisterChecker(lc)
Health Endpoints
# Liveness check (for Kubernetes liveness probe)
GET /health/live
# Readiness check (for Kubernetes readiness probe)
GET /health/ready
# Detailed status
GET /health
Response example:
{
"status": "healthy",
"ready": true,
"live": true,
"checks": {
"database": {"status": "healthy", "critical": true, "latency_ms": 5},
"redis": {"status": "healthy", "critical": false, "latency_ms": 2},
"llm_service": {"status": "healthy", "critical": true, "latency_ms": 150},
"agent_core": {"status": "healthy", "critical": true, "latency_ms": 10}
}
}
22.6 Alerting Strategy
Prometheus Alertmanager Rules
groups:
- name: shannon-alerts
rules:
# High workflow failure rate
- alert: HighWorkflowFailureRate
expr: |
sum(rate(shannon_workflows_completed_total{status="failed"}[5m]))
/ sum(rate(shannon_workflows_completed_total[5m])) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Workflow failure rate > 10%"
description: "{{ $value | humanizePercentage }} of workflows failed in last 5m"
# High LLM latency
- alert: HighLLMLatency
expr: |
histogram_quantile(0.95,
sum(rate(shannon_pattern_duration_seconds_bucket{pattern="react"}[5m])) by (le)
) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "P95 pattern execution > 30s"
# Abnormal token consumption
- alert: AbnormalTokenUsage
expr: |
sum(rate(shannon_tokens_by_pattern_total[1h])) > 1000000
for: 1h
labels:
severity: warning
annotations:
summary: "Token consumption > 1M/hour"
# Critical dependency unhealthy
- alert: CriticalDependencyDown
expr: |
shannon_health_check_status{critical="true"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Critical dependency {{ $labels.component }} is down"
Alert Severity Levels
| Level | Trigger Condition | Response Time | Notification Channel |
|---|---|---|---|
| Critical | Critical dependency down, failure rate >10% | Immediate | PagerDuty/Phone |
| Warning | Latency increase, non-critical dependency issues | Within 1 hour | Slack |
| Info | Resource usage approaching threshold | Next business day |
22.7 Dashboard Design
Grafana Panel Layout
Overview Row
- Active workflow count (Gauge)
- Requests per minute (Counter rate)
- Success rate (Percentage)
- Total token consumption (Counter)
Performance Row
- Workflow latency distribution (Heatmap)
- Pattern execution time (Histogram)
- P50/P95/P99 trends (Graph)
Resource Row
- Token consumption trends (Graph)
- Cost trends (Graph)
- Pattern usage distribution (Pie)
Error Row
- Error rate time series (Graph)
- Error type distribution (Pie)
- Recent error list (Table)
PromQL Examples
# Workflow success rate
sum(rate(shannon_workflows_completed_total{status="success"}[5m]))
/ sum(rate(shannon_workflows_completed_total[5m]))
# P99 latency
histogram_quantile(0.99,
sum(rate(shannon_workflow_duration_seconds_bucket[5m])) by (le, workflow_type)
)
# Token consumption rate (per minute)
sum(rate(shannon_tokens_by_pattern_total[5m])) * 60
# Pattern usage proportion
sum by (pattern) (rate(shannon_pattern_executions_total[1h]))
/ sum(rate(shannon_pattern_executions_total[1h]))
# Reflection improvement rate
sum(rate(shannon_reflection_improvements_total[1h]))
/ sum(rate(shannon_pattern_executions_total{pattern="reflection"}[1h]))
22.8 Common Pitfalls
Pitfall 1: Cardinality Explosion
// Wrong: High cardinality labels
AgentExecutions.WithLabelValues(userID, taskID, timestamp)
// userID * taskID * timestamp = infinite combinations, Prometheus will OOM
// Correct: Limited cardinality
AgentExecutions.WithLabelValues(agentType, mode)
// agentType * mode = limited combinations (e.g., 5 * 3 = 15)
Rule of thumb: Label combinations per metric should not exceed 1000.
Pitfall 2: Improper Trace Sampling
// Wrong: Sample all requests (production storage cost too high)
tp := trace.NewTracerProvider(
trace.WithSampler(trace.AlwaysSample()),
)
// Correct: Probabilistic sampling
tp := trace.NewTracerProvider(
trace.WithSampler(trace.TraceIDRatioBased(0.1)), // 10% sampling
)
// Better: Full sampling for errors, proportional for normal requests
tp := trace.NewTracerProvider(
trace.WithSampler(trace.ParentBased(
trace.TraceIDRatioBased(0.1),
)),
)
Pitfall 3: Too Many Logs
// Wrong: Logging in loops
for _, item := range items {
logger.Info("Processing item", zap.String("id", item.ID))
}
// If items has 10000 entries, that's 10000 log lines
// Correct: Batch summary
logger.Info("Processing items",
zap.Int("count", len(items)),
zap.Duration("total_duration", totalDuration),
)
Pitfall 4: Overly Strict Health Checks
// Wrong: Any dependency failure reports unhealthy
if !db.Ping() || !redis.Ping() || !llm.Ping() {
return unhealthy
}
// Redis temporary flicker, entire service judged unhealthy, Kubernetes restarts
// Correct: Distinguish critical and non-critical
type CheckerState struct {
Critical bool // Only Critical=true affects overall health
}
// Redis is cache, doesn't affect core functionality, set as Non-Critical
Pitfall 5: Missing Trace ID Correlation
// Wrong: Log and trace not correlated
logger.Error("Request failed", zap.Error(err))
// Correct: Log includes Trace ID
traceID := oteltrace.SpanFromContext(ctx).SpanContext().TraceID().String()
logger.Error("Request failed",
zap.Error(err),
zap.String("trace_id", traceID),
)
// This allows jumping from log to trace details
What This Chapter Covered
- Three Pillars Division: Metrics quantify trends, Traces locate single requests, Logs record details
- Metrics Design: Categorize by layer (workflow/Agent/memory/vector), control label cardinality
- Trace Propagation: W3C Traceparent for cross-service propagation
- Health Checks: Distinguish Critical and Non-Critical dependencies
- Alert Severity: Critical for immediate response, Warning within 1 hour, Info next business day
Shannon Lab (10-Minute Quickstart)
This section helps you map this chapter's concepts to Shannon source code in 10 minutes.
Required Reading (1 file)
go/orchestrator/internal/workflows/metrics/pattern_metrics.go: Pattern-level metric definitions, see metric naming and label design
Optional Deep Dive (pick 2 based on interest)
go/orchestrator/internal/tracing/tracing.go: Tracing initialization and W3C Traceparent propagationgo/orchestrator/internal/health/manager.go: Health check manager, see Critical/Non-Critical distinction
Exercises
Exercise 1: Design Metrics System
Design a metrics system for a "code review Agent", including:
- Workflow-level metrics
- Agent-level metrics
- LLM call metrics
- Cost metrics
List metric names, labels, types (Counter/Histogram/Gauge).
Exercise 2: Alert Rules
Based on Exercise 1's metrics, design 3 alert rules:
- 1 Critical level
- 1 Warning level
- 1 Info level
Write PromQL expressions and trigger conditions.
Exercise 3 (Advanced): Debugging Scenario
User reports: "Yesterday from 3pm to 4pm, Agent was especially slow."
Describe what you would:
- Look at first which metrics
- If metrics show LLM latency increased, what to check next
- How to find the specific slow requests
- How to confirm root cause
Further Reading
- Prometheus Best Practices: Naming conventions, label design
- OpenTelemetry Documentation: Sampling strategies, Span design
- SRE Book - Monitoring: Google's monitoring experience
Next Chapter Preview
This is the last chapter of Part 7 Production Architecture. We covered three-layer architecture, Temporal workflows, observability—these are the foundations for making Agent systems "run reliably".
From Part 8, we enter Enterprise Features.
Chapter 23 covers Token Budget Control: how to implement hard cost controls at task/session/tenant three levels, preventing runaway Agents from blowing up your bills.