Chapter 24: Policy Governance (OPA)

OPA takes Agent system access control from hardcoded to declarative—policy updates don't require redeploying code, but it's not omnipotent; real security comes from defense in depth.


Quick Path (5 minutes to grasp the core)

  1. OPA extracts security rules from code, change policies without redeployment
  2. Default deny + deny takes priority: Unmatched requests are denied, deny rules have highest priority
  3. Canary release: First dry-run to observe, then gradually enforce by percentage
  4. Cache decision results, note cache key must include all fields affecting the decision
  5. FailOpen is more suitable for production than FailClosed (combined with alerting)

10-minute path: 24.1-24.3 -> 24.6 -> Shannon Lab


Your Agent system needs a new rule: "Block queries containing 'password'."

How to do it? Modify code, add an if strings.Contains(req.Query, "password"), then deploy.

A week later, security team wants ten more rules. Modify code again, deploy again.

A month later, code is full of if-else everywhere, no one can explain "why this request was rejected". Every rule change needs the full CI/CD process, urgent security fixes take 2 hours to go live.

The most extreme case I've seen: A system had 200+ security checkpoints scattered across 30+ files. A new engineer wanted to add one rule, took three days just to find all the places that needed changing.

This is the predicament of hardcoded policies.


24.1 Why Do We Need a Policy Engine?

The Predicament of Hardcoding

Comparing hardcoded vs OPA declarative policies:

// ========== Traditional Approach: Hardcoded Security Checks ==========
func SubmitTask(ctx context.Context, req *TaskRequest) error {
    if strings.Contains(req.Query, "password") { return errors.New("forbidden") }
    if req.TokenBudget > 10000 { return errors.New("budget too high") }
    if req.UserID == "blocked_user" { return errors.New("user blocked") }
    // More if-else... 50 lines later you've forgotten why you're checking
}
# ========== OPA Approach: Declarative Policy File ==========
package shannon.task

deny["dangerous pattern"] { contains(lower(input.query), "password") }
deny["budget too high"]   { input.token_budget > 10000 }
deny["user blocked"]      { blocked_users[input.user_id] }
ComparisonHardcodedOPA
Policy ChangesModify code -> Test -> DeployModify .rego -> Hot reload
Rule LocationScattered everywhereCentralized in policies/
Audit TrailNoneEvery decision logged
Release MethodFull rolloutSupports canary/Dry-run
Version ManagementNoneHash versioned

24.2 OPA Integration Architecture

Policy Execution Flow

In Shannon, OPA serves as an independent policy engine, evaluating requests before they enter the Workflow:

OPA Policy Engine Architecture

Core Components

ComponentResponsibility
OPAEngineMain entry for policy loading, compilation, evaluation
PolicyInputContext data for evaluation (user, query, budget, etc.)
DecisionEvaluation result (allow/deny + reason)
base.regoBase security policies (default deny, deny priority)

24.3 OPAEngine Implementation

Engine Interface

Shannon's OPA Engine design references go/orchestrator/internal/policy/ directory:

type Engine interface {
    Evaluate(ctx context.Context, input *PolicyInput) (*Decision, error)
    LoadPolicies() error
    IsEnabled() bool
    Environment() string
    Mode() Mode
}

Behind the simple interface are several key design decisions:

  1. Pre-compilation: Policies compiled at startup, directly executed during evaluation
  2. Caching: Decision results for same input are cached
  3. Mode switching: Supports off/dry-run/enforce three modes

Core Data Structures

// ========== PolicyInput: Evaluation Context ==========
type PolicyInput struct {
    SessionID   string    `json:"session_id"`        // Session identifier
    UserID      string    `json:"user_id"`           // User identifier
    AgentID     string    `json:"agent_id"`          // Agent identifier
    Query       string    `json:"query"`             // Request content
    Mode        string    `json:"mode"`              // simple/standard/complex
    Environment string    `json:"environment"`       // dev/staging/prod (key: differentiate environment strictness)
    TokenBudget int       `json:"token_budget"`      // Budget limit
    Timestamp   time.Time `json:"timestamp"`
}

// ========== Decision: Evaluation Result ==========
type Decision struct {
    Allow           bool              `json:"allow"`            // Whether allowed
    Reason          string            `json:"reason"`           // Decision reason
    RequireApproval bool              `json:"require_approval"` // Needs human confirmation (allow but confirm)
    PolicyVersion   string            `json:"policy_version"`   // Policy version hash (for audit)
    AuditTags       map[string]string `json:"audit_tags"`       // Audit tags
}

The Environment field lets you write policies like "loose in development, strict in production". RequireApproval supports "allowed but needs human confirmation" scenarios.


24.4 Policy Loading and Compilation

The following code shows the core policy loading flow, referencing Shannon's implementation:

func (e *OPAEngine) LoadPolicies() error {
    if !e.config.Enabled {
        return nil
    }

    policies := make(map[string]string)

    // Recursively load all .rego files
    err := filepath.Walk(e.config.Path, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }

        if !info.IsDir() && strings.HasSuffix(info.Name(), ".rego") {
            content, err := os.ReadFile(path)
            if err != nil {
                return fmt.Errorf("failed to read policy file %s: %w", path, err)
            }

            relPath, _ := filepath.Rel(e.config.Path, path)
            moduleName := strings.TrimSuffix(relPath, ".rego")
            policies[moduleName] = string(content)

            e.logger.Debug("Loaded policy file",
                zap.String("path", path),
                zap.String("module", moduleName),
            )
        }
        return nil
    })

    if err != nil {
        return fmt.Errorf("failed to walk policy directory: %w", err)
    }

    // Pre-compile policies
    regoOptions := []func(*rego.Rego){
        rego.Query("data.shannon.task.decision"),
    }

    for moduleName, content := range policies {
        regoOptions = append(regoOptions, rego.Module(moduleName, content))
    }

    regoBuilder := rego.New(regoOptions...)
    compiled, err := regoBuilder.PrepareForEval(context.Background())
    if err != nil {
        return fmt.Errorf("failed to compile policies: %w", err)
    }

    e.compiled = &compiled

    // Record policy version for audit
    versionHash := e.calculatePolicyVersion(policies)
    RecordPolicyVersion(e.config.Path, versionHash, loadTimestamp)

    return nil
}

Key design points:

  1. Recursive loading: Supports subdirectories under policies/, convenient for organizing policies by domain
  2. Pre-compilation: Uses PrepareForEval instead of compiling on every evaluation, 10x+ performance improvement
  3. Version tracking: Calculates hash of policy content, convenient for auditing "which version of policy was used for this decision"

24.5 Policy Evaluation Flow

The evaluation flow is the core of the OPA Engine, below shows the complete evaluation logic:

func (e *OPAEngine) Evaluate(ctx context.Context, input *PolicyInput) (*Decision, error) {
    startTime := time.Now()

    defaultDecision := &Decision{
        Allow:  !e.config.FailClosed,
        Reason: "policy engine disabled or no policies loaded",
    }

    if !e.enabled || e.compiled == nil {
        return defaultDecision, nil
    }

    // Check cache first
    if d, ok := e.cache.Get(input); ok {
        RecordCacheHit(string(e.config.Mode))
        return d, nil
    }

    RecordCacheMiss(string(e.config.Mode))

    // Convert input to map
    inputMap, err := e.inputToMap(input)
    if err != nil {
        if e.config.FailClosed {
            return &Decision{Allow: false, Reason: "input conversion failed"}, err
        }
        return defaultDecision, nil
    }

    // Evaluate policy
    results, err := e.compiled.Eval(ctx, rego.EvalInput(inputMap))
    if err != nil {
        RecordError("policy_evaluation", string(e.config.Mode))
        if e.config.FailClosed {
            return &Decision{Allow: false, Reason: "policy evaluation error"}, err
        }
        return defaultDecision, nil
    }

    // Parse results
    decision := e.parseResults(results, input)

    // Apply canary mode
    effectiveMode := e.determineEffectiveMode(input)
    decision = e.applyModeToDecision(decision, effectiveMode, input)

    // Record metrics and audit log
    duration := time.Since(startTime)
    e.recordComprehensiveMetrics(input, decision, effectiveMode, duration)

    // Write to cache
    e.cache.Set(input, decision)
    return decision, nil
}

There's a key concept here: FailClosed vs FailOpen.

ModeBehavior When OPA ErrorsUse Case
FailClosedReject requestSecurity-sensitive systems
FailOpenAllow requestAvailability-priority systems

For production I recommend FailOpen + alerting, rather than FailClosed. The reason: Policy engine being down causing entire system unavailability is worse than letting through a few requests. Of course, this depends on your business scenario.

Decision Cache

Decision results for same input should be cached. Key is designing the cache key well:

type decisionCache struct {
    cap    int
    ttl    time.Duration
    mu     sync.Mutex
    list   *list.List
    m      map[string]*list.Element
    hits   int64
    misses int64
}

func (c *decisionCache) makeKey(input *PolicyInput) string {
    h := fnv.New64a()
    h.Write([]byte(strings.ToLower(input.Query)))
    qh := h.Sum64()
    comp := fmt.Sprintf("%.2f", input.ComplexityScore)
    return fmt.Sprintf("%s|%s|%s|%s|%d|%s|%x",
        input.Environment, input.Mode, input.UserID,
        input.AgentID, input.TokenBudget, comp, qh,
    )
}

Cache key design considerations:

  • Include environment: Same request in dev and prod may have different decisions
  • Hash the query: Avoid key being too long
  • Keep 2 decimal places for complexity: Reduce key changes due to floating point precision

24.6 Writing Rego Policies

Shannon's base policy references config/opa/policies/base.rego. Core design principle: default deny, deny takes priority.

Basic Structure and Environment Differentiation

package shannon.task
import future.keywords.in

# ========== Default Deny (Most Important) ==========
default decision := {"allow": false, "reason": "default deny - no matching rule"}

# Deny rules take priority over all allow rules
decision := {"allow": false, "reason": reason} { some reason; deny[reason] }

# ========== Environment Differentiation ==========
# Development environment: Loose (but still has budget limits)
decision := {"allow": true, "reason": "dev environment"} {
    input.environment == "dev"
    input.token_budget <= 10000
}
# Production environment: Strict (requires user whitelist + no suspicious queries)
decision := {"allow": true, "reason": "authorized user"} {
    input.environment == "prod"
    allowed_users[input.user_id]
    input.token_budget <= 5000
    not suspicious_query
}

Why default deny? Default allow means you must foresee all dangerous situations; default deny only needs to list known safe situations, unknown situations are automatically denied.

User Management and Query Pattern Matching

# ========== User Whitelist/Blacklist ==========
allowed_users := {"admin", "orchestrator", "shannon_system", "api_user"}
privileged_users := {"admin", "shannon_system", "security_admin"}
blocked_users := {"blocked_user", "suspended_account"}

# ========== Query Pattern Levels ==========
safe_patterns := {"what is", "how to", "explain", "summarize"}  # Safe
suspicious_patterns := {"delete", "hack", "bypass", "admin", "sudo"}  # Suspicious
dangerous_patterns := {"rm -rf", "drop table", "/etc/passwd", "api key"}  # Dangerous

suspicious_query { count([p | suspicious_patterns[p]; contains(lower(input.query), p)]) > 0 }
dangerous_query { count([p | dangerous_patterns[p]; contains(lower(input.query), p)]) > 0 }

Deny Rules (Highest Priority)

# Dangerous query patterns
deny[sprintf("dangerous: %s", [p])] { dangerous_patterns[p]; contains(lower(input.query), p) }
# Exceeds budget limit
deny[sprintf("budget %d exceeds max %d", [input.token_budget, 50000])] { input.token_budget > 50000 }
# Blocked user
deny[sprintf("user %s blocked", [input.user_id])] { blocked_users[input.user_id] }
# Unauthorized user in production
deny["unauthorized user in prod"] {
    input.environment == "prod"; input.user_id != ""
    not allowed_users[input.user_id]; not privileged_users[input.user_id]
}

Budget Limits

max_budgets := {"simple": 1000, "standard": 5000, "complex": 15000}
system_limits := {"max_tokens": 50000, "max_concurrent_requests": 20}

decision := {"allow": false, "reason": sprintf("budget %d exceeds max %d for %s",
    [input.token_budget, max_budgets[input.mode], input.mode])} {
    max_budgets[input.mode] < input.token_budget
}

24.7 Execution Modes and Canary Release

One of OPA's most valuable features: Safely releasing new policies.

ModePolicy EvaluationActual BlockingLog RecordingPurpose
offNoNoNoMaintenance mode
dry-runYesNoYesTesting new policies
enforceYesYesYesOfficial execution

Mode and Canary Configuration

// ========== Execution Modes ==========
type Mode string
const (
    ModeOff     Mode = "off"      // Policy disabled
    ModeDryRun  Mode = "dry-run"  // Log only, don't execute
    ModeEnforce Mode = "enforce"  // Enforce
)

// ========== Canary Configuration ==========
type CanaryConfig struct {
    Enabled           bool     `yaml:"enabled"`
    EnforcePercentage int      `yaml:"enforce_percentage"` // 0-100%
    EnforceUsers      []string `yaml:"enforce_users"`      // Whitelist users
    DryRunUsers       []string `yaml:"dry_run_users"`      // Force dry-run users
}

// ========== Determine Effective Mode (Priority: Emergency Switch > Explicit User > Percentage) ==========
func (e *OPAEngine) determineEffectiveMode(input *PolicyInput) Mode {
    if e.config.EmergencyKillSwitch { return ModeDryRun }  // Emergency switch overrides everything
    if !e.config.Canary.Enabled { return e.config.Mode }
    // Explicit dry-run/enforce user check
    for _, u := range e.config.Canary.DryRunUsers { if input.UserID == u { return ModeDryRun } }
    for _, u := range e.config.Canary.EnforceUsers { if input.UserID == u { return ModeEnforce } }
    // Percentage-based canary
    if e.config.Canary.EnforcePercentage > 0 {
        if int(hash(input.UserID) % 100) < e.config.Canary.EnforcePercentage { return ModeEnforce }
    }
    return ModeDryRun
}

// ========== Dry-Run Handling: Evaluate But Don't Block ==========
func (e *OPAEngine) applyModeToDecision(decision *Decision, mode Mode) *Decision {
    if mode == ModeDryRun && !decision.Allow {
        decision.Allow = true  // Force allow
        decision.Reason = "DRY-RUN: would have denied - " + decision.Reason
    }
    return decision
}

Canary release flow: Write new policy -> Deploy as dry-run -> Observe logs for a week -> Gradually increase enforce percentage -> Full enforce


24.8 Configuration and Deployment

Complete Configuration

# config/shannon.yaml
policy:
  enabled: true
  path: "/app/config/opa/policies"
  mode: "dry-run"  # off, dry-run, enforce
  environment: "prod"

  fail_closed: false  # On failure: true=reject, false=allow

  emergency_kill_switch: false  # Force dry-run

  cache:
    enabled: true
    size: 1000
    ttl: "5m"

  canary:
    enabled: true
    enforce_percentage: 10  # 10% of requests will be enforced
    enforce_users:
      - "admin"
      - "senior_engineer"
    enforce_agents:
      - "synthesis-agent"
    dry_run_users:
      - "test_user"

Policy Hot Update

Shannon supports updating policies without restarting services:

# 1. Modify policy file
vim config/opa/policies/custom.rego

# 2. Trigger reload (send SIGHUP)
docker compose exec orchestrator kill -HUP 1

# 3. Verify loading
docker compose logs orchestrator | grep "Policies loaded"

Policy Testing

Always test before going live. OPA provides command line tools:

# Test using OPA CLI
opa eval --bundle config/opa/policies \
  --input test/policy_input.json \
  --data config/opa/policies \
  'data.shannon.task.decision'

Test input sample:

{
  "user_id": "test_user",
  "query": "help me understand machine learning",
  "mode": "simple",
  "token_budget": 500,
  "environment": "dev"
}

24.9 Monitoring and Audit

Key Metrics

MetricTypeDescription
policy_evaluations_totalCounterPolicy evaluation count
policy_evaluation_duration_secondsHistogramEvaluation duration
policy_decisions_total{decision}CounterCount by decision type
policy_cache_hits_totalCounterCache hits
policy_deny_reasons_total{reason}CounterCount by deny reason

Audit Log

Every policy evaluation should be logged:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "event": "policy_evaluation",
  "user_id": "developer_1",
  "session_id": "sess-abc123",
  "query_hash": "a1b2c3d4",
  "decision": "deny",
  "reason": "dangerous pattern detected: drop table",
  "effective_mode": "enforce",
  "policy_version": "abc123ef",
  "duration_ms": 2.5
}

Key fields:

  • query_hash: Hash of query content, don't log original text (privacy consideration)
  • policy_version: Hash of policy files, convenient for tracing "which version was used for this decision"
  • effective_mode: Actual execution mode (after considering canary configuration)

Alert Rules

- alert: HighPolicyDenyRate
  expr: |
    sum(rate(policy_decisions_total{decision="deny"}[5m])) /
    sum(rate(policy_decisions_total[5m])) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High policy deny rate (> 10%)"

- alert: PolicyEvaluationSlow
  expr: histogram_quantile(0.95, policy_evaluation_duration_seconds) > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Policy evaluation P95 latency > 50ms"

24.10 Common Pitfalls

PitfallProblemSolution
Forgetting Default DenyUnmatched rules return undefined (interpreted as allow)Explicitly define default decision
Deny PriorityAllow rules might bypass denyUse deny set + count(deny)==0 check
Incomplete Cache KeyField changes return wrong cached resultInclude all fields affecting decision
Ignoring Compile ErrorsPolicy syntax errors cause undefined behaviorExplicitly handle errors, decide fail-open/closed
# ========== Pitfall 1: Forgetting Default Deny ==========
decision := {"allow": true} { input.mode == "simple" }  # Wrong: no default
default decision := {"allow": false, "reason": "no match"}  # Correct

# ========== Pitfall 2: Deny Priority ==========
decision := {"allow": true} { ... }                        # Wrong: might bypass deny
decision := {"allow": false} { dangerous_query }
# Correct: deny set checked first
deny["dangerous"] { dangerous_query }
decision := {"allow": false, "reason": r} { some r; deny[r] }
decision := {"allow": true, ...} { count(deny) == 0; ... }
// ========== Pitfall 3: Incomplete Cache Key ==========
return fmt.Sprintf("%s|%s", input.UserID, input.Mode)  // Wrong: missing Environment etc.
return fmt.Sprintf("%s|%s|%s|%s|%d|%s",                // Correct: include all decision-affecting fields
    input.Environment, input.Mode, input.UserID, input.AgentID, input.TokenBudget, queryHash)

// ========== Pitfall 4: Ignoring Compile Errors ==========
compiled, _ := regoBuilder.PrepareForEval(ctx)         // Wrong: ignoring error
compiled, err := regoBuilder.PrepareForEval(ctx)       // Correct: explicit handling
if err != nil {
    if failClosed { return nil, err }                  // FailClosed: reject all
    e.enabled = false                                  // FailOpen: disable policy engine
}

24.11 Framework Comparison

OPA isn't the only policy engine. How do other frameworks handle access control?

FrameworkPolicy MechanismHot UpdateCanary ReleaseAudit Log
Shannon + OPADeclarative RegoSupportedBuilt-in CanaryComplete
LangChainNo built-inN/AImplement yourselfImplement yourself
LangGraphCallbackRequires restartImplement yourselfImplement yourself
CrewAINo built-inN/AN/AN/A
KubernetesOPA GatekeeperSupportedSupportedComplete

OPA's advantage is that it's a general-purpose policy engine, not just designed for Agent systems. You can use the same tech stack to manage Kubernetes admission control, API gateway authorization, microservices access control.


What This Chapter Covered

  1. Default Deny: Always define default decision as deny, safety first
  2. Deny Priority: Use deny set to implement deny rule priority, avoid bypasses
  3. Canary Release: First dry-run to validate, then canary rollout by percentage
  4. Decision Cache: LRU cache reduces evaluation overhead, note cache key design
  5. Complete Audit: Record every decision's reason and policy version

Shannon Lab (10-Minute Quickstart)

This section helps you map this chapter's concepts to Shannon source code in 10 minutes.

Required Reading (1 file)

  • config/opa/policies/base.rego: See where default decision is defined to understand "default deny", usage of deny set, query pattern matching design with dangerous_patterns set

Optional Deep Dive (pick 2 based on interest)

  • go/orchestrator/internal/policy/engine.go (if exists): Understand OPA Engine initialization and evaluation flow, see LoadPolicies and Evaluate functions
  • OPA Playground (https://play.openpolicyagent.org/): Test Rego policies online, copy base.rego there and test with different inputs

Exercises

Exercise 1: Design a New Policy

Scenario: Product requires "no complex tasks during off-hours (22:00-08:00)". Write the corresponding Rego rule, requirements:

  • Use input.timestamp field
  • Only affects tasks where mode == "complex"
  • Provide clear deny reason

Exercise 2: Source Code Understanding

Read Shannon's config/opa/policies/base.rego:

  1. If a request matches both allowed_users and dangerous_query, will it ultimately be allowed or denied? Why?
  2. What's the purpose of the safe_query_check rule? What happens if you delete it?

Exercise 3 (Advanced): Design Canary Release Plan

You wrote a new deny rule and need to roll it out to production. Design a canary release plan including:

  • What to do first week (hint: dry-run)
  • How to confirm new rule won't false-positive normal requests
  • How to rollback if problems occur

Further Reading


Next Chapter Preview

OPA solved the "who can do what" problem. But there's another question: Is tool execution safe?

An Agent called a Python script. Will this script:

  • Read sensitive files?
  • Make network requests to leak data?
  • Consume lots of CPU/memory dragging down the system?
  • Execute malicious code?

These questions, OPA can't handle. You need sandboxing.

Next chapter we'll discuss WASI Sandbox Secure Execution—using WebAssembly to isolate tool execution, preventing malicious code from escaping.

See you in Chapter 25.

Cite this article
Zhang, W. (2026). Chapter 24: Policy Governance. In AI Agent Architecture: From Single Agent to Enterprise Multi-Agent Systems. https://waylandz.com/ai-agent-book-en/chapter-24-policy-governance
@incollection{zhang2026aiagent_en_chapter_24_policy_governance,
  author = {Zhang, Wayland},
  title = {Chapter 24: Policy Governance},
  booktitle = {AI Agent Architecture: From Single Agent to Enterprise Multi-Agent Systems},
  year = {2026},
  url = {https://waylandz.com/ai-agent-book-en/chapter-24-policy-governance}
}