Lesson 09: Supervised Learning in Quantitative Trading

Machine learning is not magic, but a magnifying glass for statistical patterns. If there's no pattern in the data, even the strongest model cannot extract Alpha.

From "Holy Grail" to "Tool"

In 2017, a hedge fund recruited a team of top AI researchers, promising to "revolutionize" quantitative trading with deep learning.

They built a 10-layer LSTM network, trained on 20 years of minute-level data, running on a GPU cluster for three months.

Backtest result: 150% annualized return, Sharpe ratio of 4.0.

The founder excitedly announced: "We've found the Holy Grail!"

Three months after going live:

Month 1: +5% (as expected)
Month 2: -8% (starting to worry)
Month 3: -15% (panic)

Cumulative loss of 18%, while the S&P 500 gained 8% in the same period.

What happened?

Overfitting to historical noise: The 10-layer LSTM had millions of parameters, perfectly "memorizing" historical data, but those patterns were just noise
Prediction does not equal profit: The model's accuracy was 52%, which sounds decent, but after trading costs, it was a net loss
Distribution drift: The market structure in 2017 was already different from the training data (1997-2016)

Lesson: The correct role of machine learning in quantitative trading is not "predicting price movements," but extracting weak but robust signals from noise. This lesson teaches you how to use supervised learning correctly.

9.1 The Quantitative Perspective on Supervised Learning

What is Supervised Learning?

Using data with known answers to "train" a model so it can predict answers for unknown data.

Training Data:
  Input X (Features)          ->  Output Y (Labels)
  [Yesterday's return, Volume, RSI]  ->  [Tomorrow up/down]
  [0.02, 1.5M, 65]            ->  [Up]
  [-0.01, 2.0M, 35]           ->  [Down]

What the model learns:
  "When RSI > 60 and volume increases, probability of rising tomorrow is higher"

Prediction phase:
  New input [0.01, 1.8M, 70]  ->  Model predicts [Up?]

The "Labeling" Dilemma in Quantitative Trading

Traditional machine learning: Labels are clear (cat/dog, spam/not spam)

Quantitative problem: How should labels be defined?

Labeling Approach	Problem
"Tomorrow up = 1, down = 0"	0.01% gain and 5% gain are both "up"?
"5-day return > 1%"	During those 5 days, it might drop 10% first then recover
"Return itself"	Too much noise, model struggles to find patterns
"Positions with Sharpe > 1"	Requires look-back over entire holding period, look-ahead bias risk

The correct approach: Labels should reflect executable trading decisions, not abstract prediction targets.

Common Misconceptions

Misconception	Reality
"Higher accuracy is better"	52% accuracy with 3:1 win/loss ratio is far better than 70% accuracy with 1:1 ratio
"Complex models are stronger"	Financial data has low signal-to-noise ratio, simple models are often more robust
"More features are better"	Too many features lead to curse of dimensionality and overfitting
"Deep learning is omnipotent"	Deep learning requires massive data, which quantitative trading usually lacks

9.2 Feature Engineering: The Core Battleground of Quantitative Trading

80% of Alpha comes from feature engineering, not model selection.

Types of Features

Feature Type	Examples	Information Source
Price Features	Returns, volatility, momentum	OHLCV
Technical Indicators	RSI, MACD, Bollinger Bands	Price-derived
Statistical Features	Skewness, kurtosis, autocorrelation	Distribution characteristics
Cross-Asset Features	Sector momentum, market sentiment	Related assets
Alternative Data	Satellite imagery, social media	External data

Feature Construction Example

Suppose we have daily data for AAPL, we construct the following features:

Base Data (5 days):
Date      Open    High    Low     Close   Volume
Day 1    $180   $182   $178   $181   10M
Day 2    $181   $185   $180   $184   12M
Day 3    $184   $186   $183   $183   11M
Day 4    $183   $184   $180   $181   15M
Day 5    $181   $183   $179   $182   13M

Feature Calculation (after Day 5 close):

1. Momentum Features
   5-day return = (182 - 181) / 181 = 0.55%
   3-day return = (182 - 184) / 184 = -1.09%

2. Volatility Features
   5-day return sequence: [1.66%, -0.54%, -1.09%, 0.55%]
   Daily volatility = std() = 1.11%
   Annualized volatility = 1.11% x sqrt(252) = 17.6%

3. Volume Features
   5-day average volume = (10+12+11+15+13)/5 = 12.2M
   Today's volume / average = 13/12.2 = 1.07 (slightly above average)

4. Price Position
   5-day high = $186, low = $178
   Current position = (182-178)/(186-178) = 50% (middle position)

Standards for Good Features

Standard	Verification Method	Consequences if Not Met
Predictive Power	Univariate test IC > 0.03	Waste of computational resources
Stability	IC variation small across periods	Overfitting to specific periods
Low Correlation	Correlation with existing features < 0.7	Information redundancy
Interpretable	Can explain the logic clearly	Hard to debug

Practical Methods for Feature Selection

Method 1: Univariate Screening

Calculate the correlation between each feature and the label (IC, Information Coefficient):

IC = corr(Feature ranking, Return ranking)

Good features: IC mean > 0.03, IC stability (IC/std(IC)) > 0.5

Method 2: Importance Pruning

Train a simple model (like Random Forest), examine feature importance:

If top 5 features contribute > 80% importance:
  -> Keep only top 5-10 features
  -> Other features may be noise

Method 3: Recursive Elimination

Gradually remove the least important features, observe model performance:

Starting: 50 features, Sharpe 1.2
Reduced to 30: Sharpe 1.3 (actually improved!)
Reduced to 10: Sharpe 1.4 (continues to improve)
Reduced to 5: Sharpe 1.1 (starts declining)

-> Optimal feature count is around 10

9.3 Common Models and Their Use Cases

Model Comparison

Model	Pros	Cons	Use Cases
Linear Regression	Simple, interpretable, resists overfitting	Only captures linear relationships	Factor investing, risk models
Random Forest	Non-linear, resists overfitting, feature importance	Slow, poor at extrapolation	Classification, feature selection
XGBoost/LightGBM	Powerful, fast, handles missing values	Easy to overfit, black box	General classification/regression
LSTM	Captures temporal dependencies	Needs large data, slow	Only for data-rich scenarios
Transformer	Powerful attention mechanism	Even larger data needs, hard to train	Research frontier, rare in production

Model Failure Conditions

Linear Model Failures:

Feature-return relationship is non-linear
Strong interaction effects exist (e.g., "low valuation + high momentum")

Tree Model Failures:

Need to extrapolate (predict values outside training range)
Features change continuously (tree models make step-wise predictions)

Deep Learning Failures:

Data size < 100,000 samples (usually overfits)
Poor feature quality (garbage in, garbage out)
Severe distribution drift (history does not represent future)

Practical Choices in Quantitative Trading

Ask yourself: How much data do I have?

Data < 5,000 samples (e.g., 20 years of daily data)
  -> Linear model, Ridge regression
  -> Features < 20

Data 5,000-50,000 samples (e.g., 1 year of minute data)
  -> Random Forest, XGBoost
  -> Features 20-50

Data > 50,000 samples (e.g., tick data)
  -> Can try LSTM/Transformer
  -> But still need to verify if they outperform simple models

9.4 Special Challenges of Financial Data

Low Signal-to-Noise Ratio

Financial signals are extremely weak:

Data Type	Signal-to-Noise Ratio	Achievable Predictive Power
Image Recognition	High	Accuracy 95%+
Natural Language	Medium	Accuracy 80%+
Financial Prediction	Extremely Low	52-55% accuracy is already top-tier

Why is financial signal-to-noise so low?

Markets are nearly efficient: Obvious patterns are quickly arbitraged away
Many participants: Patterns you find, others are using too
Noise dominates: 90% of short-term price movement is random fluctuation

Non-Stationary Distribution

Training data and prediction data have different distributions:

Training Set (2015-2019):
  Average volatility 15%
  Trend-dominated

Test Set (2020):
  Volatility spikes to 80%
  Extreme rallies and crashes

-> Model fails in 2020

Coping Methods:

Use rolling windows, retrain constantly
Feature normalization using rolling statistics
Model different regimes separately

Class Imbalance

If using "Big gain (>3%) = 1, otherwise = 0":

Big gain days: ~5%
Normal days: ~95%

Model learns to "always predict 0", 95% accuracy, but completely useless

Coping Methods:

Adjust class weights
Use AUC instead of accuracy for evaluation
Stratified sampling to ensure training set class balance

9.5 Model Evaluation: Not About Accuracy

Quantitative-Specific Evaluation Metrics

Metric	Calculation	Good Standard
IC (Information Coefficient)	corr(Prediction ranking, Actual return ranking)	> 0.03
IR (Information Ratio)	IC mean / IC standard deviation	> 0.5
Long-Short Return	Top quintile - Bottom quintile return difference	Significantly positive
Turnover-Adjusted Return	Return - Trading costs	Still positive

Model Evaluation Example

Suppose the model predicts tomorrow's return ranking for 100 stocks:

Model Predicted Ranking (high->low):
  Stock A: 1 (predicted to rise most)
  Stock B: 2
  ...
  Stock Z: 100 (predicted to fall most)

Actual Return Ranking (high->low):
  Stock A: 5
  Stock B: 10
  ...
  Stock Z: 95

IC = corr([1,2,...,100], [5,10,...,95])
   = 0.85 (predicted ranking highly correlated with actual ranking)

Interpretation:

IC = 0.85 means predicted ranking is highly consistent with actual ranking
If you go long Top 10 and short Bottom 10, expect significant excess returns
But in real markets, IC is usually only 0.02-0.05

Real-World Expectations

Top quantitative funds' IC: 0.03-0.05
Average quantitative strategy IC: 0.01-0.03
Random guessing IC: 0

Don't aim for IC > 0.1, that's almost certainly overfitting

Production-Grade IC Calculation

The simple IC calculation shown earlier is fine for quick analysis, but production systems need robust handling of edge cases. Here is a production-ready implementation:

import numpy as np
from scipy.stats import spearmanr

def calculate_ic(
    signals: np.ndarray,
    returns: np.ndarray,
    method: str = "spearman"
) -> float:
    """Calculate Information Coefficient (IC) - correlation between signals and returns

    Args:
        signals: Prediction signal array
        returns: Actual returns array
        method: "spearman" (recommended) or "pearson"

    Returns:
        IC value, range [-1, 1]
    """
    if len(signals) != len(returns):
        raise ValueError("signals and returns must have same length")
    if len(signals) < 2:
        return 0.0

    # Handle missing values - essential for production
    mask = ~(np.isnan(signals) | np.isnan(returns))
    signals, returns = signals[mask], returns[mask]

    if len(signals) < 2:
        return 0.0

    if method == "spearman":
        ic, _ = spearmanr(signals, returns)
    else:
        ic = np.corrcoef(signals, returns)[0, 1]

    return float(ic) if not np.isnan(ic) else 0.0

Why Spearman is Preferred Over Pearson

Aspect	Spearman	Pearson
Outlier Sensitivity	Robust - uses ranks	Highly sensitive - single outlier can distort
Relationship Type	Captures monotonic (non-linear)	Only captures linear
Distribution Assumption	None required	Assumes normality
Financial Data Fit	Better - returns have fat tails	Poor - violates normality assumption

Key insight: Financial returns are notorious for fat tails and outliers. A single extreme day (like a market crash) can completely distort Pearson correlation. Spearman converts values to ranks first, making it immune to outlier magnitude.

Usage Examples

Basic IC Calculation:

# Daily signals and next-day returns
signals = np.array([0.02, -0.01, 0.03, 0.01, -0.02])
returns = np.array([0.015, -0.005, 0.02, 0.008, -0.01])

ic = calculate_ic(signals, returns)
print(f"IC: {ic:.4f}")  # IC: 0.9000 (strong correlation)

Rolling IC (Time Series):

def rolling_ic(
    signals: pd.Series,
    returns: pd.Series,
    window: int = 20
) -> pd.Series:
    """Calculate rolling IC over a sliding window"""
    ic_series = pd.Series(index=signals.index, dtype=float)

    for i in range(window, len(signals)):
        sig_window = signals.iloc[i-window:i].values
        ret_window = returns.iloc[i-window:i].values
        ic_series.iloc[i] = calculate_ic(sig_window, ret_window)

    return ic_series

# Example: 20-day rolling IC
rolling_ic_values = rolling_ic(signal_series, return_series, window=20)

Information Ratio (IR) Calculation:

def calculate_ir(ic_series: pd.Series) -> float:
    """Calculate Information Ratio = mean(IC) / std(IC)

    IR > 0.5 indicates a robust signal
    """
    ic_clean = ic_series.dropna()
    if len(ic_clean) < 2 or ic_clean.std() == 0:
        return 0.0
    return ic_clean.mean() / ic_clean.std()

# Example
ir = calculate_ir(rolling_ic_values)
print(f"IR: {ir:.2f}")  # IR > 0.5 is good

Production Notes

Data Alignment: Ensure signals and returns are properly aligned. Signals at time T should correspond to returns from T to T+1 (or your prediction horizon). Off-by-one errors are the most common IC calculation bug.
Minimum Sample Size: Use at least 30 observations for meaningful IC. With fewer samples, the correlation estimate has high variance and is unreliable.
IC Decay Monitoring: Track IC over time. A declining IC trend indicates:
- Alpha decay (your signal is becoming crowded)
- Regime change (market structure shifted)
- Model drift (features no longer predictive)

# IC decay detection
recent_ic = rolling_ic_values.tail(20).mean()
historical_ic = rolling_ic_values.tail(60).head(40).mean()

if recent_ic < historical_ic * 0.5:
    print("WARNING: IC has decayed by 50%+")

Cross-Sectional vs Time-Series IC:
- Cross-sectional IC: Correlation across stocks at one point in time (typical for stock selection)
- Time-series IC: Correlation over time for one asset (typical for timing signals)
The implementation above works for both; just ensure your data is structured correctly.

9.6 Multi-Agent Perspective

Model Selection for Signal Agent

In multi-agent systems, supervised learning models are the core of the Signal Agent:

What to Do When Models Fail?

Signal Agent needs to detect whether its own model is failing:

Detection Metric	Threshold	Triggered Action
Rolling IC < 0 for 20 consecutive days	-	Reduce signal weight by 50%
Prediction distribution anomaly	Skewness > 2	Pause signal output
Model confidence declining	Prediction variance increasing	Notify Meta Agent

Key Design: Signal Agent is not always correct; it needs a "self-doubt" mechanism.

Code Implementation (Optional)

Expand code example

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit

def create_features(df: pd.DataFrame, lookback: int = 20) -> pd.DataFrame:
    """Build quantitative features"""
    features = pd.DataFrame(index=df.index)

    # Momentum features
    features['ret_1d'] = df['close'].pct_change(1)
    features['ret_5d'] = df['close'].pct_change(5)
    features['ret_20d'] = df['close'].pct_change(20)

    # Volatility features
    features['vol_20d'] = df['close'].pct_change().rolling(20).std()

    # Volume features
    features['vol_ratio'] = df['volume'] / df['volume'].rolling(20).mean()

    # Price position
    features['price_pos'] = (
        (df['close'] - df['low'].rolling(20).min()) /
        (df['high'].rolling(20).max() - df['low'].rolling(20).min())
    )

    return features.shift(1)  # Avoid look-ahead bias

def create_label(df: pd.DataFrame, horizon: int = 5, threshold: float = 0.02):
    """Create label: whether future n-day return exceeds threshold"""
    future_ret = df['close'].pct_change(horizon).shift(-horizon)
    label = (future_ret > threshold).astype(int)
    return label

def calculate_ic(predictions: pd.Series, returns: pd.Series) -> float:
    """Calculate Information Coefficient"""
    return predictions.corr(returns, method='spearman')

def walk_forward_train(df: pd.DataFrame, n_splits: int = 5):
    """Walk-Forward training and evaluation"""
    features = create_features(df)
    labels = create_label(df)

    # Align data
    valid_idx = features.dropna().index.intersection(labels.dropna().index)
    X = features.loc[valid_idx]
    y = labels.loc[valid_idx]

    tscv = TimeSeriesSplit(n_splits=n_splits)
    results = []

    for train_idx, test_idx in tscv.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
        model.fit(X_train, y_train)

        predictions = model.predict_proba(X_test)[:, 1]
        # Calculate forward returns (next 5 days), aligned with prediction horizon
        forward_returns = df.loc[X_test.index, 'close'].pct_change(5).shift(-5)
        ic = calculate_ic(pd.Series(predictions, index=X_test.index),
                         forward_returns)

        results.append({
            'test_start': X_test.index[0],
            'test_end': X_test.index[-1],
            'ic': ic,
            'accuracy': (model.predict(X_test) == y_test).mean()
        })

    return pd.DataFrame(results)

Lesson Deliverables

After completing this lesson, you will have:

Understanding of supervised learning's role in quantitative trading - Know what it can and cannot do
Practical feature engineering skills - Able to construct and screen effective quantitative features
Model selection framework - Choose appropriate models based on data size
Correct evaluation methods - Use IC/IR instead of accuracy to evaluate models

Acceptance Criteria

Check Item	Acceptance Standard	Self-Test Method
Feature Construction	Can build 5 features from OHLCV data	Given price data, manually calculate feature values
Model Selection	Can recommend a model based on data size	Given "3 years daily" scenario, state recommendation
IC Calculation	Can explain what IC=0.03 means	Without notes, explain IC's business meaning
Failure Detection	Can list 3 model failure signals	Design Signal Agent's self-check logic

Lesson Summary

Understand supervised learning's correct role in quantitative trading: extracting weak signals, not predicting price movements
Master core feature engineering methods: construction, screening, evaluation
Recognize special challenges of financial data: low signal-to-noise, non-stationarity, class imbalance
Learn to use IC/IR instead of accuracy to evaluate models
Understand how Signal Agent integrates supervised learning models

Next Lesson Preview

Lesson 10: From Models to Agents

Models only output predictions, but trading requires decisions. How do you package a "prediction model" into an "Agent capable of making decisions"? In the next lesson, we bridge the gap from models to Agents.