Lesson 06: The Harsh Reality of Data Engineering

Data problems kill more strategies than model problems.

A Nightmare Weekend

Friday night at 11 PM, Xiao Wang deployed his carefully developed US stock quantitative strategy. Backtested 80% annual return, Sharpe ratio 2.5, everything looked perfect.

Monday morning at 9 AM, he was woken by an alert message: "Strategy error, unable to fetch data."

He got up to check and found Alpha Vantage API returning 429 Too Many Requests. His program was requesting 10 times per minute, exceeding the free account's API rate limit (5 times per minute).

After fixing it, the strategy finally started running.

Monday afternoon at 2 PM, another alert: "Strategy position anomaly."

This time it was a data problem: The API returned a missing minute bar, causing MACD to calculate NaN, and the program treated NaN as a sell signal, clearing all positions.

After fixing that, he decided to do a comprehensive data check. Then he discovered more problems:

Some timestamps were UTC, others were ET (Eastern Time)
Some bars had 0 volume (pre-market/after-hours sessions)
Historical data and real-time data had different field names (close vs Close)

For an entire week, he didn't write a single line of strategy code - all spent fixing data problems.

This is the harsh reality of data engineering.

6.1 Data Source Selection

Free Data Sources

Data Source	Market Coverage	Pros	Cons
Yahoo Finance (yfinance)	US stocks, ETFs	Free, long history	Recent stability issues reported (API changes/rate limits/missing data); build robust fallbacks
Alpaca Markets	US stocks	Free IEX data, developer-friendly API	Free tier has limited data coverage
Binance API	Cryptocurrency	Real-time, free	Only Binance data
Alpha Vantage	Stocks, forex, crypto	Free tier available	Free-tier quotas are often very low (limits change; verify on the official site)
CCXT	Cryptocurrency	Unified interface	Depends on exchange APIs

Note: Data vendors may change products, APIs, quotas, pricing, or even discontinue services. Treat data sources as having “delisting risk” and always have backups.

Paid Data Sources (2024-2025 Reference Pricing)

Data Source	Market Coverage	Monthly Cost	Features
Bloomberg	Global	$2,500+	Institutional standard, highest quality
LSEG Workspace (formerly Refinitiv)	Global	Quote-based pricing	LSEG subsidiary, former Reuters data
Polygon.io (now rebranded to Massive)	US stocks	$99+	Real-time + historical, developer-friendly
Nasdaq Data Link (formerly Quandl)	Alternative data	Per dataset pricing	Satellite, consumer data, etc.

How to Choose?

Personal learning/research:
    -> Free data sources + tolerate data issues

Small fund (< $1M AUM):
    -> Paid data (basic tier) + build data validation

Institutional (> $10M AUM):
    -> Bloomberg/Refinitiv + multi-source cross-validation

Core principle: Data quality determines strategy ceiling. Garbage in, garbage out.

Data Source Evolution Path

Stage Recommended Source Notes
Learning/Prototyping Yahoo Finance (yfinance) Free daily bars, sufficient for exercises
Small-Scale Live Polygon.io / Alpaca Real-time quotes, developer-friendly APIs
Institutional Production Databento / Exchange Direct Feeds L1/L2 real-time data, low latency

This lesson's exercises work perfectly with Yahoo Finance. When you're ready for live trading, you'll need to upgrade to professional-grade data sources — see Lesson 19 for details.

6.2 The Painful Reality of APIs

The Code You Think You'll Write vs Reality

You think:

data = api.get_history("AAPL", days=365)

Reality:

Pseudocode Reference (replace with specific SDK like yfinance / alpaca / ccxt)

# Note: This is pseudocode showing robust data fetching design patterns
# api, log, EmptyDataError, RateLimitError need to be replaced with your specific library

import time
from typing import Optional
import pandas as pd

def get_history_robust(
    symbol: str,
    days: int,
    max_retries: int = 5,
    backoff_base: float = 2.0
) -> Optional[pd.DataFrame]:
    """
    Robust data fetching function

    Problems handled:
    - Rate Limiting (exponential backoff retry)
    - Empty data (validation + alerting)
    - Connection errors (auto-reconnect)
    - Timeouts (reasonable timeout settings)
    """
    for attempt in range(max_retries):
        try:
            # Set timeout (replace with your API call)
            data = api.get_history(
                symbol,
                days=days,
                timeout=30
            )

            # Validate data (generic logic)
            if data is None:
                raise EmptyDataError(f"No data for {symbol}")

            if len(data) == 0:
                raise EmptyDataError(f"Empty data for {symbol}")

            if len(data) < days * 0.9:  # Less than 90% of expected data
                log.warning(f"Data incomplete: got {len(data)}, expected ~{days}")

            # Check for anomalies
            if data['close'].isnull().any():
                log.warning("NaN values detected, filling...")
                data['close'] = data['close'].ffill()

            return data

        except RateLimitError:
            wait_time = backoff_base ** attempt
            log.info(f"Rate limited, waiting {wait_time}s...")
            time.sleep(wait_time)

        except ConnectionError:
            log.warning("Connection lost, reconnecting...")
            api.reconnect()
            time.sleep(1)

        except TimeoutError:
            log.warning("Request timeout, retrying...")

        except Exception as e:
            log.error(f"Unexpected error: {e}")
            if attempt == max_retries - 1:
                raise

    return None

Common API Problems

Problem	Symptom	Solution
Rate Limiting	429 errors	Exponential backoff + request queue
Missing Data	Empty response or partial data	Multi-source backup + missing detection
Data Delay	Timestamps lagging	Record delay + adjust strategy
Inconsistent Format	Field names change	Unified data adapter layer
Unstable Connection	Disconnects, timeouts	Heartbeat detection + auto-reconnect

Production Note: Canonical Data Model Different markets and data sources use different symbol formats (600519.SH, AAPL, 0700.HK, BTC/USDT). Production systems need a unified normalization layer (Normalizer) that converts external formats to an internal standard (e.g., SYMBOL.EXCHANGE: AAPL.NASDAQ, 0700.HKEX). This is a prerequisite for multi-market trading. Watch out for common symbol resolution pitfalls: delisted tickers failing silently, corporate actions changing symbols, multi-part tickers (e.g., BRK.B), and more. See Lesson 19 for details.

The Real Cost of Rate Limiting

Suppose you need minute bars for 100 symbols:

API Limit	Completion Time
1 request/second	100 seconds
10 requests/second	10 seconds
Unlimited	< 1 second

High-frequency strategies need high API quotas - this is the core value of paid data sources.

6.3 Time Alignment Issues

Timezone Chaos

Different data sources use different timezones:

Data Source	Default Timezone
Binance	UTC
Yahoo Finance	Exchange local time
China A-shares	UTC+8
US stocks	ET (Eastern Time)

If not handled:

9:30 A-share data + 9:30 US stock data -> Actually 12 hours apart
Strategy uses "same moment" data, but they're actually different moments

Solution: Convert everything to UTC, store and compute in UTC.

import pandas as pd

def normalize_timezone(df, source_tz='UTC'):
    """Standardize to UTC"""
    if df.index.tz is None:
        df.index = df.index.tz_localize(source_tz)
    df.index = df.index.tz_convert('UTC')
    return df

Tick to Candlestick Aggregation

When aggregating tick data to candlesticks, note:

Issue	Explanation
Open price	First trade price in that period
Close price	Last trade price in that period
High price	Highest trade price in that period
Low price	Lowest trade price in that period
Volume	Sum of all trades in that period

Trap: What if there's no trade in a given minute?

Option 1: Fill OHLC with previous minute's close price
Option 2: Mark as missing, handle later
Option 3: Exclude that time point from aggregation

Different approaches lead to different indicator calculations. Must standardize handling rules.

Cross-Asset Data Alignment

Stocks, futures, forex have different trading hours:

Asset	Trading Hours
China A-shares	9:30-11:30, 13:00-15:00
US stocks	9:30-16:00 ET
Forex	Near 24/7
Cryptocurrency	24/7

If strategy needs cross-asset signals:

When A-shares close, US hasn't opened yet
Use A-share close + US open? There's a time gap
Need careful "synchronization point" design

6.4 Data Quality Issues

Anomaly Detection and Handling

Anomaly Type	Detection Method	Handling
Price jumps	Price change > 20%	Check if real (stock split?)
Volume anomaly	0 or extreme values	Check exchange status
Missing values	NaN	Forward fill or delete
Duplicate data	Duplicate timestamps	Keep first or last

def detect_anomalies(df, price_col='close', volume_col='volume'):
    """Detect data anomalies"""
    issues = []

    # Price jumps
    returns = df[price_col].pct_change()
    jumps = returns.abs() > 0.20
    if jumps.any():
        issues.append(f"Price jumps detected: {jumps.sum()} times")

    # Zero volume
    zero_volume = df[volume_col] == 0
    if zero_volume.any():
        issues.append(f"Zero volume: {zero_volume.sum()} bars")

    # Missing values
    nulls = df.isnull().sum()
    if nulls.any():
        issues.append(f"Null values: {nulls.to_dict()}")

    # Duplicate timestamps
    duplicates = df.index.duplicated()
    if duplicates.any():
        issues.append(f"Duplicate timestamps: {duplicates.sum()}")

    return issues

Stock Dividend and Split Adjustments

Stock historical prices need "adjustment":

Event	Actual Change	Unadjusted Data	Adjusted Data
2-for-1 split	Shares double, price halves	Price cliff before/after	Smooth and continuous
Dividend	Ex-dividend deduction	Gap on ex-div date	Historical prices adjusted

Consequences of not adjusting:

In backtests, strategy generates false signals on ex-dates
Return calculations show extreme values

Futures Roll Handling

Futures contracts expire and need "rolling":

January 2024: Hold IF2401 (January contract)
Before January expiry: Switch to IF2402 (February contract)

Problem: IF2401 closes at 4000, IF2402 closes at 4050
      Not real price movement, but different contracts

Solutions:

Calculate spread when splicing, adjust historical prices
Or use "continuous main contract" data (pre-processed by data vendor)

6.5 Survivorship Bias

What is Survivorship Bias?

If you only backtest with "stocks that still exist today":

2010 stock pool: 1000 stocks
Still exist in 2024: 800 stocks
Delisted/bankrupt: 200 stocks

Your backtest only uses 800, ignoring those 200 failed companies
-> Backtest returns are inflated

How Serious is Survivorship Bias?

Research	Conclusion
Academic studies	Annual returns inflated 1-3%
Value investing strategies	Bias more severe (cheap stocks more likely to delist)
Small-cap strategies	Bias is largest

How to Avoid?

Method	Difficulty	Effectiveness
Use database including delisted stocks	High	Most accurate
Use historical index constituents	Medium	Fairly accurate
Acknowledge bias in conclusions	Low	At least honest

Paid data sources typically provide "survivorship-bias-free" datasets - another value of paying.

6.6 Alternative Data Introduction

What is Alternative Data?

Traditional data: Prices, volume, financial statements Alternative data: Everything else

Type	Data Source	Application
Satellite data	Parking lot car counts, oil tank levels	Predict retail earnings, oil inventory
Text data	News, social media, earnings calls	Sentiment analysis, event-driven
Credit card data	Spending statistics	Predict company revenue
Web traffic	Website visits	Predict e-commerce performance
GPS data	Phone locations	Foot traffic analysis

Challenges of Alternative Data

Challenge	Explanation
High noise	Signal/noise ratio far lower than price data
Compliance risk	Privacy issues, data source legality
Fast alpha decay	Once widely used, advantage disappears
High cost	Data itself expensive + processing costs

Multi-Agent Perspective

6.7 Data Pipeline Design Principles

Three Principles

Principle	Explanation
Immutability	Never modify raw data, generate new files after processing
Traceability	Record source, processing time, processing version for each data point
Redundant backup	Cross-validate with at least two data sources

Data Pipeline Architecture

Data Collection Layer
    |
    |-> Raw Data Storage (immutable)
    |         |
    |         v
    |-> Data Cleaning Layer
    |     - Anomaly handling
    |     - Missing value filling
    |     - Timezone standardization
    |         |
    |         v
    |-> Feature Engineering Layer
    |     - Technical indicator calculation
    |     - Label generation
    |         |
    |         v
    --> Ready Data Storage -> Strategy Use

Monitoring and Alerting

Monitor Item	Suggested Threshold	Alert Action
Data delay	> 1 minute	Warning
Missing data rate	> 1%	Warning
Anomaly ratio	> 0.1%	Investigate
API error rate	> 5%	Pause strategy

Code Implementation (Optional)

Data Quality Check Framework

import pandas as pd
from dataclasses import dataclass
from typing import List

@dataclass
class DataQualityReport:
    symbol: str
    start_date: str
    end_date: str
    total_rows: int
    missing_rows: int
    null_values: dict
    anomalies: List[str]
    is_valid: bool

def check_data_quality(df: pd.DataFrame, symbol: str) -> DataQualityReport:
    """Comprehensive data quality check"""

    anomalies = []

    # Check time continuity
    # Note: This example assumes *daily* bars and uses business days as a rough expectation.
    # For minute/tick data, you must generate expected_index based on exchange trading sessions,
    # holidays, lunch breaks (if any), etc.
    idx = pd.DatetimeIndex(df.index)
    expected_index = pd.bdate_range(start=idx.min().normalize(), end=idx.max().normalize())
    expected_rows = len(expected_index)
    actual_rows = len(df)
    missing_rows = expected_rows - actual_rows

    if missing_rows > expected_rows * 0.05:
        anomalies.append(f"High missing rate: {missing_rows/expected_rows:.1%}")

    # Check NULL values
    null_counts = df.isnull().sum().to_dict()

    # Check price jumps
    if 'close' in df.columns:
        returns = df['close'].pct_change()
        extreme_moves = (returns.abs() > 0.2).sum()
        if extreme_moves > 0:
            anomalies.append(f"Extreme price moves: {extreme_moves}")

    # Check volume
    if 'volume' in df.columns:
        zero_volume = (df['volume'] == 0).sum()
        if zero_volume > 0:
            anomalies.append(f"Zero volume periods: {zero_volume}")

    # Check timestamp duplicates
    duplicates = df.index.duplicated().sum()
    if duplicates > 0:
        anomalies.append(f"Duplicate timestamps: {duplicates}")

    return DataQualityReport(
        symbol=symbol,
        start_date=str(df.index[0]),
        end_date=str(df.index[-1]),
        total_rows=actual_rows,
        missing_rows=missing_rows,
        null_values=null_counts,
        anomalies=anomalies,
        is_valid=len(anomalies) == 0
    )

Lesson Deliverables

After completing this lesson, you will have:

Deep understanding of data problems - Know that data engineering is the main battlefield of quant
API best practices - Robust data fetching code framework
Data quality checking ability - Can identify and handle common data issues
Data pipeline design thinking - Understand production-grade data system architecture

Key Takeaways

Understand the trade-offs between free and paid data sources
Master methods for handling API rate limiting, missing data, etc.
Recognize hidden issues like timezone chaos, survivorship bias
Understand the value and challenges of alternative data
Understand data pipeline design principles

Extended Reading

Background: Exchanges and Order Book Mechanics - Source of tick data
Background: Cryptocurrency Trading Characteristics - Challenges of 24/7 data

Next Lesson Preview

Lesson 07: Backtest System Pitfalls

Even if your data is fine, backtests can still deceive you. Look-Ahead Bias, overfitting, ignoring trading costs... These traps have caused countless seemingly perfect strategies to fail miserably in live trading. Next lesson reveals the truth about backtesting.

Stage	Recommended Source	Notes
Learning/Prototyping	Yahoo Finance (yfinance)	Free daily bars, sufficient for exercises
Small-Scale Live	Polygon.io / Alpaca	Real-time quotes, developer-friendly APIs
Institutional Production	Databento / Exchange Direct Feeds	L1/L2 real-time data, low latency