Lesson 20: Production Operations

Goal: Make the system run correctly even when you're not there, and notify you promptly when problems occur.


A Typical Scenario (Illustrative)

Note: The following is a synthetic example to illustrate common phenomena; numbers are illustrative and do not correspond to any specific individual/account.

In 2022, an independent quantitative trader launched his auto-trading system that he had carefully refined for six months. Backtests were perfect, and paper trading ran stably for three months. He confidently invested $500,000 in live trading.

The first two weeks, the system performed normally. On Monday of the third week, he was woken by a text message: account balance insufficient.

Investigation revealed: After Friday's close, the data source API had been updated with a changed return format. The system failed to parse, threw an exception, but the exception was silently swallowed. Over the weekend, the system had no log output (he thought weekends had no trading so he didn't pay attention). Monday at open, the system reconnected but state was confused, executing a series of absurd orders, including a massive short position.

By the time he discovered it, the account had lost 40%.

Where did things go wrong?

  1. No health checks: Data source failure wasn't detected
  2. Exceptions were swallowed: Errors weren't reported
  3. No monitoring alerts: Two days of zero logs and nobody knew
  4. No circuit breaker: Didn't stop trading when state was abnormal

This is why "code runs" and "system is reliable" are two different things. Operations isn't an afterthought - it's part of system design.


20.1 Monitoring System

Four Levels of Monitoring

Four Levels of Monitoring System

Key Monitoring Metrics

LevelMetricThreshold ExampleAlert Level
InfrastructureServer aliveHeartbeat timeout 30sCritical
Network latency> 200msWarning
ServiceCPU usage> 80% for 5 minutesWarning
Memory usage> 90%Critical
Disk space< 10GBWarning
ApplicationData source connectionDisconnected > 1 minuteCritical
API success rate< 95%Warning
Task queue backlog> 100 itemsWarning
Agent response latency> 5sWarning
BusinessDaily drawdown> 3%Warning
Weekly drawdown> 5%Critical
Abnormal tradeSingle trade > 10% positionCritical
Trading frequency3x deviation from meanWarning

Health Check Design

Health Check Checklist

20.2 Logging System

Golden Rules of Logging

Logs are a product, not a byproduct.

PrincipleExplanationAnti-Pattern
StructuredJSON format for easy querying and analysisFree text "buy successful"
TraceableEvery log contains trace_idCan't correlate logs for same trade
LeveledDEBUG/INFO/WARN/ERRORAll using print()
ContextualInclude time, symbol, price, quantityJust "Error occurred"
ImmutableCannot be modified after writingLogs stored locally and editable

Log Levels

LevelUsageExample
DEBUGDevelopment debugging, disabled in production"Calculating RSI for AAPL..."
INFONormal business flow"Order submitted: BUY AAPL 100@$180"
WARNAbnormal but recoverable"API rate limited, retrying in 5s"
ERRORAbnormal and needs attention"Order rejected: insufficient margin"
CRITICALSystem-level failure"Database connection lost"

Standard Trading Log Format

{
  "timestamp": "2024-01-15T09:30:00.123Z",
  "level": "INFO",
  "service": "execution_agent",
  "trace_id": "tx_20240115_001",
  "event": "order_submitted",
  "data": {
    "symbol": "AAPL",
    "side": "BUY",
    "quantity": 100,
    "price": 180.50,
    "order_type": "LIMIT",
    "order_id": "ORD_12345"
  },
  "context": {
    "signal_id": "sig_20240115_001",
    "signal_strength": 0.75,
    "regime": "trending",
    "portfolio_value": 1000000
  }
}

Key Fields:

  • trace_id: Connects the complete chain from signal to execution
  • event: Standardized event name for easy statistical analysis
  • context: Decision basis for post-hoc attribution

20.3 Alerting System

Alert Design Principles

PrincipleExplanation
Alerts must be actionableKnow what to do when receiving alert
Avoid alert fatigueToo many alerts = no alerts
Tiered handlingDifferent levels use different channels
Include contextAlert info sufficient to diagnose problem

Alert Channels

LevelChannelResponse Time Requirement
LowEmail, daily summaryNext day
MediumSlack/Teams messageWithin 1 hour
HighSMS + phone callWithin 5 minutes
CriticalAuto circuit breaker + phoneImmediate

Alert Template

[CRITICAL] Trading System Alert

Time: 2024-01-15 10:30:15 EST
Service: Risk Agent
Event: Drawdown triggered control line

Details:
- Current drawdown: 10.2%
- Trigger threshold: 10%
- Today's P&L: -$52,000
- Account equity: $948,000

Actions taken:
- Stopped new positions
- Started deleveraging process

You need to:
- Check for abnormal trades
- Confirm if manual intervention needed

Dashboard: https://dashboard.example.com/alerts/12345

Alert Suppression

ProblemSolution
Same issue repeated alertsAggregation: Same type alerts only one per 5 minutes
Momentary jitter triggers alertDuration: Threshold must persist N minutes to alert
Non-trading hours at nightQuiet period: Non-trading hours downgrade to low priority
Known maintenance windowsSuppression: No alerts during preset maintenance periods

20.4 Disaster Recovery

Failure Classification

Failure TypeExampleImpactRecovery Strategy
Data source failureAPI unavailableCannot get quotesSwitch to backup source
Trading interface failureBroker maintenanceCannot place ordersPause trading, log pending orders
Local service failureAgent process crashPartial function lossAuto-restart
Network failureNetwork outageComplete failureWait for recovery, manual state verification
Data errorAbnormal price jumpsWrong decisionsDetect anomaly, pause processing

Recovery Flow

Disaster Recovery Flow

State Consistency Checks

Must verify after recovery:

Check ItemMethodInconsistency Handling
Position consistencyCompare system records with brokerUse broker as source of truth, update local
Order statusQuery all pending ordersCancel or confirm
Cash balanceCompare calculated vs. actualRecalculate
Data completenessCheck for missing market dataBackfill missing data

20.5 Daily Operations Checklists

Pre-Market Checks (Daily Must-Do)

Pre-Market Checklist

Market-Hours Automation

One of the most overlooked reliability requirements is automated scheduling around market hours. Production systems should not depend on a human remembering to start the system each morning.

Automation Schedule (US Equities, Eastern Time):

09:15 ET  Auto-start: data connections, health checks, position reconciliation
09:25 ET  Pre-market validation: confirm data feeds live, risk parameters loaded
09:30 ET  Market open: trading logic active
15:50 ET  End-of-day: stop new position entries, begin close-only mode
16:00 ET  Market close: final position snapshot
16:05 ET  Post-close: P&L calculation, log aggregation, alert summary
16:30 ET  Auto-stop: graceful shutdown of non-essential processes

Weekday-only: Use cron or systemd timers with market calendar awareness.
Skip holidays (Martin Luther King Day, Thanksgiving, etc.).

Why this matters: A system that requires manual startup introduces human error as a daily risk. One forgotten morning during a volatile open can cause the kind of state-confusion disaster described in this lesson's opening story. Automate the lifecycle, and treat manual intervention as the exception.

Post-Market Checks (Daily Must-Do)

Post-Market Checklist

Weekly/Monthly Checks

FrequencyCheck ItemPurpose
WeeklyStrategy performance reviewIdentify abnormal trends
System resource usage trendsPrevent resource exhaustion
Alert summary analysisDiscover systemic issues
MonthlyComplete P&L attributionStrategy evaluation
Model drift detectionIdentify failing signals
Infrastructure assessmentCost and performance optimization
Disaster recovery drillVerify recovery process

20.6 Provider Abstraction

As your system matures, you will inevitably switch data providers, add execution venues, or run across multiple brokers. If your core logic is tightly coupled to a specific vendor's API, each change becomes a rewrite.

The solution is a clean provider interface that isolates vendor-specific details behind a stable contract:

Data Provider Interface:
  Connect(config)     -> connection
  Subscribe(symbols)  -> stream
  Stream()            -> MarketDataEvent

Execution Venue Interface:
  SubmitOrder(order)   -> OrderAck
  CancelOrder(id)      -> CancelAck
  GetPositions()       -> []Position

Each vendor gets a thin adapter that implements this interface. Your Signal Agent, Risk Agent, and Monitor Agent never know or care which broker or data source is behind the interface. This pattern enables multi-venue trading, failover to backup data sources, and clean testing with mock providers -- all without touching core logic.

Design Principle: Treat every external dependency (data, execution, alerts) as a replaceable module behind an interface. The first vendor switch will repay the abstraction cost many times over.


20.7 Model Deployment and CI/CD

20.7.1 Why Trading Systems Need CI/CD

Every deployment is a risk event. CI/CD makes this risk controllable.

Problems with traditional approach:

ProblemConsequence
Manual deploymentEasy to miss steps, configuration errors
No version trackingCan't rollback to known good version when problems occur
Inconsistent environments"Works on my machine," but fails in production
Insufficient testingNew code introduces bugs, discovered only after deployment
No model update recordsDon't know when or why models were updated

20.7.2 CI/CD Pipeline for Trading Systems

Model Deployment Pipeline

20.7.3 Stage Details

Stage 1: Build

# Example: GitHub Actions configuration
build:
  steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Install dependencies
      run: pip install -r requirements.txt

    - name: Type checking
      run: mypy src/ --strict

    - name: Code formatting
      run: ruff check src/

    - name: Package model
      run: |
        python -m src.models.package \
          --model-path models/signal_v3.pkl \
          --output artifacts/

Stage 2: Test

Test TypeContentPass Criteria
Unit testsFunction logic correct100% pass
Integration testsAgent collaboration normal100% pass
Backtest validationPerformance on historical dataSharpe > threshold, drawdown < threshold
Sanity checksNo obvious overfittingOut-of-sample Sharpe > In-sample x 0.7
# Backtest validation example
def test_model_performance():
    """Ensure new model meets minimum standards in backtest"""
    results = run_backtest(
        model='models/signal_v3.pkl',
        start_date='2022-01-01',
        end_date='2023-12-31'
    )

    assert results['sharpe'] >= 1.0, f"Sharpe ratio insufficient: {results['sharpe']}"
    assert results['max_drawdown'] <= 0.15, f"Drawdown too large: {results['max_drawdown']}"
    assert results['win_rate'] >= 0.45, f"Win rate insufficient: {results['win_rate']}"

    # Overfitting check
    in_sample_sr = results['in_sample_sharpe']
    out_sample_sr = results['out_sample_sharpe']
    assert out_sample_sr >= in_sample_sr * 0.7, "Out-of-sample performance decay too large, possible overfitting"

Quality Gate: Dual-Layer Backtest Validation

A robust CI/CD pipeline should require two layers of backtest validation before any deployment:

  1. Fast vectorized backtest -- Tests pure signal logic (factor returns, IC, turnover) in seconds. Catches regressions in alpha generation quickly.
  2. OMS-integrated backtest -- Routes signals through the full order management path (risk checks, order splitting, simulated fills with slippage). Catches execution-reality gaps that vectorized tests miss.

Both layers must pass. A signal that looks profitable in vectorized testing but fails after realistic execution modeling is not deployable. Making this a required CI gate prevents the most common class of production failures: strategies that work in theory but lose money to friction.

Stage 3: Stage (Pre-Release)

Shadow Mode:

Shadow Mode Deployment

Shadow Mode Running Standards:

MetricPass CriteriaCheck Frequency
Signal consistencyNew/old signal correlation > 0.9 or clear improvementReal-time
LatencyNew model latency < Production model x 1.2Hourly
Resource usageCPU/memory within budgetEvery 5 minutes
Abnormal signalsNo extreme signals (> 3 standard deviations)Real-time
Run durationAt least 5 trading days-

Stage 4: Prod (Production)

Gradual Release Strategies:

StrategyApplicable ScenarioRisk Control
CanarySmall portion of capital on new model first5% -> 25% -> 50% -> 100%
Blue-GreenInstant switch, maintain rollback capabilityCan switch back to old version anytime
RollingGradually replace Agent instancesReplace one at a time
class CanaryDeployment:
    """Canary deployment controller"""

    def __init__(self, old_model, new_model):
        self.old_model = old_model
        self.new_model = new_model
        self.canary_weight = 0.05  # Start at 5%

    def get_signal(self, market_data: dict) -> dict:
        # Distribute traffic by weight
        if random.random() < self.canary_weight:
            signal = self.new_model.predict(market_data)
            signal['model_version'] = 'canary'
        else:
            signal = self.old_model.predict(market_data)
            signal['model_version'] = 'stable'
        return signal

    def promote_canary(self, new_weight: float):
        """Increase canary weight, gradually ramp up"""
        if new_weight > self.canary_weight:
            log.info(f"Promoting canary weight: {self.canary_weight:.0%} -> {new_weight:.0%}")
            self.canary_weight = new_weight

    def rollback(self):
        """Rollback to stable version"""
        log.warning("Canary rollback! Switching to stable version")
        self.canary_weight = 0.0

20.7.4 Model Version Management

Version Naming Convention:

Model version format: v{major}.{minor}.{patch}-{timestamp}

Example:
  v2.3.1-20240115  # Version v2.3.1 from January 15, 2024

Version rules:
  major: Model architecture change (e.g., from XGBoost to neural network)
  minor: Feature or parameter adjustment
  patch: Bug fix

Model Registry:

FieldExampleDescription
model_idsignal_v2.3.1Unique identifier
created_at2024-01-15 10:30:00Creation time
created_byauto_train_pipelineCreator (person or pipeline)
metrics{"sharpe": 1.5, "drawdown": 0.08}Validation metrics
statusstaging / production / retiredCurrent status
artifact_paths3://models/signal_v2.3.1.pklStorage location
config_hasha1b2c3d4...Configuration file hash
data_hashe5f6g7h8...Training data hash (reproducibility)

20.7.5 Auto Rollback

class AutoRollback:
    """Auto rollback controller"""

    def __init__(self, rollback_thresholds: dict):
        self.thresholds = rollback_thresholds
        self.metrics_buffer = []

    def check_and_rollback(self, current_metrics: dict) -> bool:
        """Check if rollback is needed"""

        # Immediate rollback conditions (single trigger)
        if current_metrics.get('error_rate', 0) > self.thresholds['max_error_rate']:
            self.trigger_rollback("Error rate too high")
            return True

        if current_metrics.get('latency_p99', 0) > self.thresholds['max_latency']:
            self.trigger_rollback("Latency too high")
            return True

        # Cumulative rollback conditions (trend judgment)
        self.metrics_buffer.append(current_metrics)
        if len(self.metrics_buffer) >= 10:  # Sufficient samples
            recent_sharpe = np.mean([m['sharpe'] for m in self.metrics_buffer[-10:]])
            if recent_sharpe < self.thresholds['min_sharpe']:
                self.trigger_rollback(f"Sharpe declined: {recent_sharpe:.2f}")
                return True

        return False

    def trigger_rollback(self, reason: str):
        """Execute rollback"""
        log.error(f"Triggering auto rollback: {reason}")

        # 1. Switch to last stable version
        model_registry.activate('last_stable')

        # 2. Send alert
        alert_system.send(
            level='critical',
            title='Model Auto Rollback',
            message=f'Reason: {reason}'
        )

        # 3. Record rollback event
        audit_log.record('rollback', {'reason': reason})

20.7.6 CI/CD Implementation Checklist

+-------------------------------------------------------------+
|               CI/CD Implementation Checklist                 |
+-------------------------------------------------------------+
|                                                             |
|  [ ] Version Control                                        |
|    +-- All code in Git                                      |
|    +-- Model files have version numbers                     |
|    +-- Configuration separated from code                    |
|    +-- Training data has snapshots                          |
|                                                             |
|  [ ] Automated Testing                                      |
|    +-- Unit tests cover core logic                          |
|    +-- Backtest as part of CI                               |
|    +-- Performance benchmark tests                          |
|    +-- Overfitting detection                                |
|                                                             |
|  [ ] Deployment Process                                     |
|    +-- One-click deployment (no manual steps)               |
|    +-- Shadow mode validation                               |
|    +-- Gradual release mechanism                            |
|    +-- One-click rollback                                   |
|                                                             |
|  [ ] Observability                                          |
|    +-- Deployment event logging                             |
|    +-- Version and performance correlation                  |
|    +-- Auto-triggered rollback                              |
|    +-- Change audit trail                                   |
|                                                             |
+-------------------------------------------------------------+

20.8 Multi-Agent Perspective

Monitor Agent's Role

Monitor Agent Role

Health Collaboration Between Agents

AgentReports to MonitorReceives from Monitor
Signal AgentSignal generation latency, success ratePause/resume
Risk AgentRisk trigger count, statusForced circuit breaker
Execution AgentOrder status, execution qualityCancel pending orders
Regime AgentDetection latency, confidenceSwitch to default state
Data AgentData update time, qualitySwitch data source

Acceptance Criteria

After completing this lesson, use the following criteria to verify learning:

CriterionStandardSelf-Test Method
Understand monitoring levelsCan describe content of four monitoring levelsList key metrics for each level
Design log formatCan design structured trading logsWrite a complete trading log JSON
Design alert rulesCan distinguish alert levels and channelsDesign alert matrix for your system
Understand recovery flowCan describe failure recovery stepsDraw recovery flow diagram
Master daily checklistsCan list pre/post-market check itemsCreate your checklists

Comprehensive Exercise

Design Your Operations System:

  1. List key metrics you need to monitor (at least 10)
  2. Design log format (include necessary fields)
  3. Design alert rules (metric, threshold, level, channel)
  4. Write pre-market checklist
  5. Design recovery flow for a failure scenario

Lesson Deliverables

After completing this lesson, you will have:

  1. Four-Level Monitoring Framework - Infrastructure/Service/Application/Business
  2. Log Standard Template - Structured trading log format
  3. Alert Design Guidelines - Levels, channels, suppression strategies
  4. Operations Checklists - Pre/post-market check items
  5. CI/CD Deployment Pipeline - Build, test, shadow mode, gradual release

Key Takeaways

  • Understand the four levels of monitoring and their key metrics
  • Master structured log design principles
  • Understand alerting system design essentials
  • Master failure recovery flow and state consistency checks
  • Establish habit of daily operations checks
  • Design CI/CD pipeline for safe model deployment

Further Reading


Next Lesson Preview

Lesson 21: Project Implementation

Theory ends here. In the next lesson, we'll build a complete multi-agent trading system from scratch - integrating all knowledge from the first 20 lessons, from architecture design to code implementation, from backtest validation to production deployment.

Cite this chapter
Zhang, Wayland (2026). Lesson 20: Production Operations. In AI Quantitative Trading: From Zero to One. https://waylandz.com/quant-book-en/Lesson-20-Production-Operations
@incollection{zhang2026quant_Lesson_20_Production_Operations,
  author = {Zhang, Wayland},
  title = {Lesson 20: Production Operations},
  booktitle = {AI Quantitative Trading: From Zero to One},
  year = {2026},
  url = {https://waylandz.com/quant-book-en/Lesson-20-Production-Operations}
}