Lesson 03: Math and Statistics Fundamentals

The essence of quantitative trading is describing markets in mathematical language. If your mathematical assumptions are wrong, the more precise your model, the faster you lose money.

LTCM: The Nobel Laureates' Fatal Assumption

In 1998, Long-Term Capital Management (LTCM) had two Nobel Prize winners in Economics, Wall Street's top mathematicians, and $125 billion in assets.

Their models were exquisite, based on one core assumption: market returns follow a normal distribution.

According to the normal distribution, the probability of their strategy losing more than 1% in a single day was 1 in 100,000. With 252 trading days per year, theoretically this would happen once every 400 years.

Then Russia defaulted on its debt.

On August 21, 1998, LTCM lost $550 million in a single day. Over the next month, they lost $4.6 billion - nearly all their capital. According to the normal distribution, the probability of this event was 10^-27, roughly equivalent to something that wouldn't happen once per second since the birth of the universe.

But it happened.

What went wrong?

Markets don't follow normal distributions: Real markets have "fat-tailed distributions" - extreme events occur far more frequently than normal distribution predicts
Correlations can spike suddenly: During normal times, asset correlations are low; during crises, all assets spike to correlation of 1
Volatility clusters: After big drops, more big drops often follow - it's not independent "coin flipping"

LTCM's lesson: The model wasn't imprecise - the mathematical assumptions were fundamentally wrong.

This lesson's goal: Help you understand the true characteristics of financial data, and avoid using "coin flip" math to analyze markets.

3.1 Time Series Fundamentals

Sequence vs IID Assumption

Traditional statistics assumes data is IID (Independent and Identically Distributed):

Independent: Today's data is unrelated to yesterday's
Identically Distributed: Each day's data comes from the same distribution

Financial data almost never satisfies the IID assumption:

import numpy as np

# IID data: coin flips
coin_flips = np.random.choice([0, 1], size=100)  # Each flip independent, probability constant

# Financial data: today's price depends on yesterday
prices = [100]
for _ in range(99):
    # Today's price = yesterday's price + random change
    prices.append(prices[-1] * (1 + np.random.normal(0, 0.02)))

Why does this matter?

If you model with IID assumptions, you'll underestimate the probability of consecutive losses
Real markets have "momentum": rises tend to continue rising, falls tend to continue falling

Lag and Autocorrelation

Autocorrelation: The correlation between current values and past values.

Paper Calculation Example:

Assume 5 consecutive days of returns: +2%, -1%, +3%, +2%, -2%

Today (t)	Yesterday (t-1)
-1%	+2%
+3%	-1%
+2%	+3%
-2%	+2%

Observed patterns:

Yesterday up, today sometimes up sometimes down -> Autocorrelation near 0 (random)
If yesterday up usually means today up -> Positive autocorrelation (momentum)
If yesterday up usually means today down -> Negative autocorrelation (mean reversion)

SPY (S&P 500 ETF) daily return autocorrelation is typically near zero (e.g., ~0.02 for 2010-2020), indicating US stock daily returns are close to random walk, difficult to predict short-term. Exact values vary by sample period and methodology.

Interpretation:

Autocorrelation near 0 -> Past has low predictive value for future (weak-form efficient market)
Significantly positive autocorrelation -> Momentum effect (trend following may work)
Significantly negative autocorrelation -> Mean reversion effect

Code Implementation (for engineers)

import pandas as pd

def calculate_autocorrelation(series, lag=1):
    """Calculate autocorrelation coefficient at lag periods"""
    return series.autocorr(lag=lag)

# Example: SPY (S&P 500 ETF) daily return autocorrelation
returns = prices_series.pct_change().dropna()

print(f"Lag-1 autocorrelation: {calculate_autocorrelation(returns, 1):.3f}")
print(f"Lag-5 autocorrelation: {calculate_autocorrelation(returns, 5):.3f}")

Stationarity Testing

Stationarity: Statistical properties (mean, variance) don't change over time.

from statsmodels.tsa.stattools import adfuller

def check_stationarity(series, name="Series"):
    """ADF test: Determine if series is stationary"""
    result = adfuller(series.dropna())
    print(f"{name}:")
    print(f"  ADF Statistic: {result[0]:.4f}")
    print(f"  p-value: {result[1]:.4f}")
    print(f"  Conclusion: {'Stationary' if result[1] < 0.05 else 'Non-stationary'}")

# Price series are typically non-stationary
check_stationarity(prices_series, "Price Series")

# Return series are typically stationary
check_stationarity(returns_series, "Returns Series")

Why test for stationarity?

Most statistical models assume stationary input
Modeling non-stationary series produces "spurious regression"
Solution: Use returns (differences) instead of prices

3.2 Mathematical Definition of Returns

Simple Returns vs Log Returns

Paper Calculation:

Assume AAPL goes from $100 to $110:

Simple return = (110 - 100) / 100 = 10%
Log return = ln(110/100) = ln(1.1) ~ 9.53%

Why do quants often use log returns? See this example:

Additivity Problem (verify with calculator):

	Day 1	Day 2	Total Return
Price	$100 -> $110	$110 -> $100	$100 -> $100
Simple Return	+10%	-9.09%	0%
Direct Sum			10% - 9.09% = 0.91% X
Log Return	+9.53%	-9.53%	0%
Direct Sum			9.53% - 9.53% = 0% ✓

Conclusion: Log returns can be directly summed; simple returns must be multiplied. When calculating 100-day cumulative returns, log returns need only 100 additions, while simple returns need 100 multiplications.

Property	Simple Return	Log Return
Additivity	X Multi-period returns can't be directly summed	O Multi-period returns can be directly summed
Symmetry	X Up 10% then down 10% != 0	O More symmetric
Normality Assumption	Harder to satisfy	Closer to normal

Cumulative Return Calculation

Paper Calculation:

5 consecutive days of daily returns: +1%, -2%, +3%, +1%, -1%

Simple return cumulative (multiply):

(1+0.01) x (1-0.02) x (1+0.03) x (1+0.01) x (1-0.01) - 1
= 1.01 x 0.98 x 1.03 x 1.01 x 0.99 - 1
= 1.0194 - 1 = 1.94%

Log return cumulative (direct sum):

0.01 + (-0.02) + 0.03 + 0.01 + (-0.01) = 2%

Both methods give similar results, but log returns are simpler to calculate.

Annualized Returns

Paper Calculation:

Assume you earned 5% over 60 trading days, what's the annualized return?

Annualized Return = (1 + 5%)^(252/60) - 1
                  = 1.05^4.2 - 1
                  = 1.227 - 1
                  = 22.7%

Why 252? US stocks have approximately 252 trading days per year (365 days - weekends - holidays).

Common Mistakes:

X Directly multiplying monthly return by 12 as annualized (ignoring compounding)
X Using calendar days instead of trading days (should use 252, not 365)
X Annualizing before subtracting fees

Code Implementation (for engineers)

def cumulative_return(returns, method='simple'):
    """Calculate cumulative return"""
    if method == 'simple':
        return (1 + returns).prod() - 1  # Multiply
    else:
        return returns.sum()  # Direct sum

def annualize_return(total_return, days, trading_days_per_year=252):
    """Annualized return"""
    years = days / trading_days_per_year
    return (1 + total_return) ** (1 / years) - 1

# Example: Earned 5% over 60 days
ann_return = annualize_return(0.05, 60)
print(f"Annualized return: {ann_return:.2%}")  # ~ 22.7%

3.3 Mathematical Definition of Risk

Variance and Standard Deviation

Paper Calculation:

Assume 5 days of daily returns: +1%, -2%, +3%, 0%, -1%

Step 1: Calculate mean

Mean = (1 - 2 + 3 + 0 - 1) / 5 = 0.2%

Step 2: Calculate variance (average of squared deviations from mean)

Variance = [(1-0.2)^2 + (-2-0.2)^2 + (3-0.2)^2 + (0-0.2)^2 + (-1-0.2)^2] / 5
         = [0.64 + 4.84 + 7.84 + 0.04 + 1.44] / 5
         = 14.8 / 5 = 2.96 (%^2)

Step 3: Calculate standard deviation (volatility)

Daily volatility = sqrt(2.96) ~ 1.72%

Step 4: Annualize

Annualized volatility = Daily volatility x sqrt(252) ~ 1.72% x 15.87 ~ 27.3%

Intuitive meaning of volatility:

Annualized Volatility	Typical Asset	Possible 1-Year Range (68% probability)
15%	Large-cap stocks (SPY)	-15% to +15%
30%	Tech stocks (TSLA)	-30% to +30%
80%	Cryptocurrency (BTC)	-80% to +80%

Remember this formula: Annualized volatility = Daily volatility x sqrt(252) ~ Daily volatility x 16

Code Implementation (for engineers)

def calculate_volatility(returns, annualize=True, trading_days=252):
    """Calculate volatility (standard deviation)"""
    daily_vol = returns.std()
    if annualize:
        return daily_vol * np.sqrt(trading_days)
    return daily_vol

# Example
daily_returns = pd.Series([0.01, -0.02, 0.03, 0, -0.01])
annual_vol = calculate_volatility(daily_returns)
print(f"Annualized volatility: {annual_vol:.2%}")

Covariance and Correlation

Intuitive Understanding:

Correlation coefficient measures whether two assets "move together":

Correlation	Meaning	Example
+1	Perfectly synchronized: A up means B up	Same-sector stocks (AAPL vs MSFT)
0	Unrelated: A's movement unrelated to B	Gold vs tech stocks
-1	Perfectly opposite: A up means B down	Stocks vs VIX (fear index)

Actual market correlations (reference values):

Asset Pair	Normal Period Correlation	Crisis Period Correlation
AAPL vs MSFT	+0.7	+0.9
SPY vs TLT (bonds)	-0.3	-0.5 or +0.8
SPY vs GLD (gold)	+0.1	-0.2

"Correlation spike" during crises: Assets uncorrelated during normal times can suddenly spike to 0.9 correlation during crises. This is what tripped up LTCM - they assumed stable correlations.

Multi-Agent Perspective: Portfolio Agent's core job is finding low-correlation assets to build portfolios, but Risk Agent must consider the risk of "correlation spikes during crises."

Code Implementation (for engineers)

def analyze_correlation(returns_a, returns_b):
    """Analyze correlation between two assets"""
    corr = returns_a.corr(returns_b)
    print(f"Correlation coefficient: {corr:.3f}")

    if corr > 0.7:
        print("-> Highly positive correlated: Move together, poor diversification")
    elif corr < -0.3:
        print("-> Negatively correlated: Natural hedge, good for portfolio")
    else:
        print("-> Low correlation: Valuable for risk diversification")

The Danger of Distribution Assumptions

The normal distribution assumption will kill you. This is the core lesson from LTCM.

Paper Calculation:

Assume SPY has 20% annualized volatility, daily volatility is about 1.26% (= 20% / sqrt(252)).

Under normal distribution:

Probability of daily drop > 1 sigma (1.26%) = 16% (normal)
Probability of daily drop > 2 sigma (2.52%) = 2.3% (0.5 times per month)
Probability of daily drop > 3 sigma (3.78%) = 0.13% (once every 3 years)
Probability of daily drop > 4 sigma (5.04%) = 0.003% (once every 125 years)

But what actually happened?

Date	Event	SPY Single-Day Drop	"How many sigma" per normal distribution	Theoretical frequency
2020-03-16	COVID crash	-12.0%	9.5 sigma	10^20 years
2008-10-15	Financial crisis	-9.0%	7.1 sigma	10^12 years
2011-08-08	US downgrade	-6.7%	5.3 sigma	15 million years
2018-02-05	Volatility crash	-4.1%	3.3 sigma	4 years

Conclusion: Events the normal distribution says "once in 10^20 years" actually happen every few years.

This is the power of "fat-tailed distributions" - extreme events occur far more frequently than normal distribution predicts. If your risk model assumes normal distribution, you will inevitably blow up during some "black swan" event.

3.4 Special Characteristics of Financial Time Series

Fat Tails

Two key metrics:

Skewness: Is the distribution symmetric?
- Skewness = 0 -> Symmetric (similar magnitude ups and downs)
- Skewness < 0 -> Left-skewed (more crashes than rallies)
- Stock market skewness is typically -0.5 to -1, indicating crash risk exceeds rally opportunity
Kurtosis: How "fat" are the tails?
- Kurtosis = 3 -> Normal distribution
- Kurtosis > 3 -> Fat tails (more extreme events than normal distribution predicts)
- Stock market kurtosis is typically 5-10, far exceeding normal distribution's 3

Metric	Normal Distribution	SPY Actual	Meaning
Skewness	0	-0.7	Crashes are more severe
Kurtosis	3	8	Frequent extreme events

Impact of fat tails on strategies:

Stop-losses must consider gap risk (overnight crashes may jump past your stop price)
VaR models will severely underestimate risk
Need to use Expected Shortfall (CVaR) instead of VaR

Code Implementation (for engineers)

def analyze_tail_risk(returns):
    """Analyze tail risk"""
    skew = returns.skew()  # Skewness
    kurt = returns.kurtosis()  # Kurtosis (excess)

    print(f"Skewness: {skew:.3f} {'(negative skew, beware of crashes)' if skew < -0.5 else ''}")
    print(f"Kurtosis: {kurt+3:.3f} {'(fat tails!)' if kurt > 0 else ''}")

Volatility Clustering

GARCH Effect: Large volatility tends to be followed by large volatility, small volatility tends to be followed by small volatility.

Intuitive Understanding:

Recall the COVID crash in March 2020:

March 9: SPY down 7.6%
March 12: SPY down 9.5%
March 16: SPY down 12.0%

Crashes are not independent "coin flips" - once crashing begins, it often continues. This is volatility clustering.

Volatility autocorrelation is as high as 0.7-0.9, far higher than return autocorrelation (about 0). This means:

Returns are nearly unpredictable (random walk)
Volatility is predictable (tomorrow's volatility is likely similar to today's)

Practical Implications:

Volatility predictability > Return predictability
Trend Agent should build positions during low volatility, reduce during high volatility
Risk Agent should dynamically adjust stop-losses based on recent volatility

Code Implementation (for engineers)

def detect_volatility_clustering(returns, window=20):
    """Detect volatility clustering"""
    rolling_vol = returns.rolling(window).std()
    vol_autocorr = rolling_vol.autocorr(lag=1)
    print(f"Volatility lag-1 autocorrelation: {vol_autocorr:.3f}")

Non-Stationarity and Regime Shifts

Markets switch between different "regimes," each with different statistical properties:

Regime	Characteristics	Suitable Strategy
Bull Market	Low volatility, positive returns, low correlation	Momentum, buy and hold
Sideways	Medium volatility, returns near 0	Mean reversion
Crisis	High volatility, negative returns, high correlation	Reduce positions, hedge

Typical structural break cases (approximate values):

Note: The table below shows typical magnitudes; specific values vary with methodology, window, and data source.

Time	Event	Volatility Change	Correlation Change
2008.09	Lehman collapse	15% -> 80%	0.3 -> 0.95
2020.03	COVID outbreak	12% -> 85%	0.4 -> 0.90
2022.01	Fed rate hikes	15% -> 30%	0.5 -> 0.70

Impact of Regime Shifts on strategies:

Models trained on old regimes will fail in new regimes
Correlation spikes mean diversification suddenly fails
Volatility spikes mean stop-loss positions need recalculation

This is why multi-agent is needed - different regimes need different strategies; one model can't handle it all. Meta Agent's core task is identifying the current regime and dispatching to the appropriate specialist Agent.

Code Implementation (for engineers)

def detect_regime_change(returns, window=60):
    """Simple regime change detection"""
    rolling_vol = returns.rolling(window).std()
    vol_change = rolling_vol.diff().abs()
    threshold = vol_change.quantile(0.95)
    regime_changes = vol_change > threshold
    print(f"Detected {regime_changes.sum()} potential regime change points")
    return regime_changes

3.5 Common Misconceptions

Misconception 1: Normal distribution adequately describes markets

Wrong. Real markets have fat-tailed distributions; extreme events occur far more frequently than normal distribution predicts. LTCM assumed normal distribution, and "once in a million years" events happened within weeks.

Misconception 2: Lower volatility is always safer

Not entirely. Low volatility can be the calm before the storm. More dangerously, low volatility periods often coincide with increased leverage; once volatility suddenly spikes, losses are amplified.

Misconception 3: Correlations are stable and predictable

Dangerous assumption. Assets with 0.3 correlation during normal times can spike to 0.9 during crises. Diversification may fail exactly when you need it most.

Misconception 4: Historical data can predict the future

Only partially. Markets experience Regime Shifts (structural breaks), and past statistical patterns can suddenly become invalid. Markets in 2008 and 2020 were completely different from before.

3.6 Multi-Agent Perspective

Different Agents can use different statistical assumptions:

Agent	Statistical Assumption	Applicable Scenario
Trend Agent	Momentum exists (positive autocorrelation)	Trending market
Mean Reversion Agent	Mean reversion (negative autocorrelation)	Sideways market
Risk Agent	Fat-tailed distribution + volatility clustering	All scenarios
Regime Agent	Non-stationary + structural breaks	Regime identification

Key Insight:

Don't let all Agents use the same statistical assumptions
Risk Agent must use the most conservative assumptions (fat tails, frequent extreme events)
Meta Agent's responsibility is identifying the current regime and dispatching to the appropriate Agent

Lesson Deliverables

After completing this lesson, you will have:

Correct understanding of financial data characteristics - Know why "coin flip" models don't work in markets
Return calculation ability - Master correct calculation of log returns and annualized returns
Intuition for risk measurement - Understand practical meanings of volatility, correlation, fat-tailed distributions
Vigilance about statistical assumptions - Know how normal distribution assumptions lead to fatal errors

Verification Checklist

Checkpoint	Verification Standard	Self-Test Method
Return calculation	Can hand-calculate log returns and annualized returns	Given $100->$110->$99, calculate 2-day log returns and cumulative
Volatility calculation	Can hand-calculate standard deviation and annualize	Given 5 daily returns, calculate annualized volatility
Fat tail understanding	Can explain why normal distribution assumption is dangerous	Without notes, explain LTCM's statistical failure reason
Regime awareness	Can list 3 market regimes and corresponding strategies	Draw regime switching diagram, label each regime's characteristics

Comprehensive Exercise (complete with calculator):

An intraday strategy:

10-day returns: +1%, -0.5%, +2%, -1%, +0.5%, -2%, +1.5%, 0%, -0.5%, +1%
Calculate: (1) Cumulative return (2) Daily volatility (3) Annualized volatility (4) Is this a high or low volatility strategy?

Click to reveal answer

Cumulative return (simple method) ~ 2.0%
Daily volatility ~ 1.2%
Annualized volatility = 1.2% x sqrt(252) ~ 19%
Annualized volatility of 19% is medium volatility, close to SPY's volatility level

Key Takeaways

Understand financial data doesn't satisfy IID assumption; autocorrelation and non-stationarity exist
Master the difference between log returns vs simple returns, and annualization methods
Recognize dangers of normal distribution assumption: fat tails, volatility clustering, extreme events
Understand impact of Regime Shifts on strategies, and how multi-agent systems respond

Extended Reading

Background: Alpha and Beta - Mathematical foundation of return decomposition
Background: Famous Quant Disasters - The cost of ignoring fat tails
Background: Statistical Traps of Sharpe Ratio - Estimation error, multiple testing, and Deflated Sharpe

Next Lesson Preview

Lesson 04: The Real Role of Technical Indicators

MACD, RSI, Bollinger Bands... Are these indicators actually useful? The answer: they're not "buy/sell signals," but feature engineering. Next lesson we reveal the true nature of technical indicators.