Background: Alternative Data (NLP and Satellite)

"When everyone is looking at the same price-volume data, Alpha is elsewhere."


What is Alternative Data?

Traditional Data: Price, volume, financial statements - easily accessible to everyone

Alternative Data: Information extracted from non-traditional sources that has predictive value

Data TypeSource ExamplesPrediction Target
Text SentimentNews, social media, earnings callsShort-term price volatility
Satellite ImageryParking lots, farmland, oil tanksRevenue forecasting
Credit Card TransactionsConsumer payment dataRetail performance
Web TrafficApp downloads, website visitsUser growth
Supply ChainShipping tracking, port dataSupply/demand forecasting

Text Data and NLP

Sentiment Analysis Basics

Convert text into numerical signals:

News Headline: "Apple quarterly revenue hits record high, exceeds analyst expectations"
Sentiment Score: +0.8 (positive)

News Headline: "Tesla faces safety concerns, under regulatory investigation"
Sentiment Score: -0.7 (negative)

Building Sentiment Signals

Simple Method: Dictionary counting

Positive words: {"growth", "exceeds", "record", "breakthrough"...}
Negative words: {"decline", "loss", "investigation", "recall"...}

Sentiment Score = (Positive word count - Negative word count) / Total words

Advanced Method: Pre-trained language models

Using BERT/GPT-type models:
1. Input: Full news text
2. Output: Sentiment category (positive/neutral/negative) or continuous score
3. Advantage: Understands context and sarcasm

Text Data Source Comparison

SourceTimelinessCoverageNoiseCost
News (Reuters, Bloomberg)Minute-levelLarge capsLow$$$$
Twitter/XSecond-levelHot stocksHigh$
Reddit (r/wallstreetbets)Minute-levelRetail favoritesVery highFree
Earnings Call TranscriptsQuarterlyFull coverageLow$$
SEC FilingsImmediateFull coverageLowFree

Time Decay of Text Signals

Sentiment signal strength after news release:

   Strength
    |
 100| ####
  80| #### ####
  60| #### #### ####
  40| #### #### #### ####
  20| #### #### #### #### ####
    +-----------------------------> Time
        5min  30min  1hr   4hr   1day

Conclusion: Sentiment signals are mainly effective within hours of release

Satellite Data Applications

Typical Application Scenarios

Retail: Parking lot vehicle counting

Monitoring: Walmart, Target parking lots
Indicator: Vehicle count changes
Prediction: Quarterly same-store sales growth
Lead Time: 1-2 weeks before earnings report

Energy: Oil tank storage monitoring

Monitoring: Global oil storage facilities
Method: Calculate storage from floating roof tank shadows
Prediction: Crude oil inventory changes
Data Frequency: Weekly updates

Agriculture: Crop health monitoring

Monitoring: US Midwest farmland
Indicator: Vegetation index (NDVI)
Prediction: Corn, soybean yields
Impact: Agricultural commodity futures pricing

Shipping: Port activity tracking

Monitoring: Major global ports
Indicator: Container counts, vessel dwell time
Prediction: Import/export activity, supply chain bottlenecks
Application: Shipping stocks, retail inventory

Satellite Data Processing Pipeline

1. Image Acquisition
   +- Satellite pass frequency: Every 1-7 days
   +- Resolution: 0.3-10 meters
   +- Cloud cover: Requires multi-day averaging

2. Image Processing
   +- Atmospheric correction
   +- Geometric registration
   +- Target detection (parking lot boundaries, tank locations)

3. Feature Extraction
   +- Vehicle counting (object detection models)
   +- Area calculation (pixel analysis)
   +- Time series construction

4. Signal Generation
   +- Compare with historical data
   +- Seasonal adjustment
   +- Standardization (Z-Score)

Alpha Decay in Alternative Data

Core Problem: Once data is widely used, Alpha disappears

Alternative Data Lifecycle:

Discovery Phase   | Few institutions use it, Alpha is significant
                  |
Diffusion Phase   | More people access it, Alpha declines
                  |
Maturity Phase    | Becomes mainstream, Alpha  0
                  |
                  +-----------------------------> Time

Typical Cycle: 2-5 years

Example: Satellite parking lot data

  • 2015: Few hedge funds using it, significant excess returns
  • 2018: Multiple data vendors offering it, competition intensifies
  • 2022: Already standard, need more refined analysis to extract Alpha

Cost-Benefit Analysis

Data TypeAnnual CostStock CoverageExpected ICCost-Effectiveness
News Sentiment$50K+500+0.03Medium
Social Media$10K100+0.02Low
Satellite Imagery$100K+50+0.05Low
Credit Card Transactions$500K+200+0.08Medium
Web Traffic$30K100+0.04Medium

Economic Logic:

Assumptions:
- Data cost: $100,000/year
- Covers 50 stocks
- IC improvement: 0.05

Required Capital Scale:
- $1,000,000 per stock position
- Total scale $50,000,000
- Annualized improvement 0.05 x 12%  0.6%
- Return improvement $300,000

Conclusion: At least $50M scale needed to break even

Build vs Buy

DimensionBuildBuy
Cost StructureHigh fixed cost, low marginal costPay per data volume
Time Investment6-12 monthsPlug and play
UniquenessMay have unique AlphaSame as others
MaintenanceRequires ongoing investmentVendor responsibility
Suitable ScaleLarge institutionsSmall/medium funds

Small Team Recommendations:

  • Start with free data (SEC filings, Reddit, Twitter)
  • Buy paid data after validating signals work
  • Focus on differentiation in data processing, not data acquisition

Multi-Agent Perspective

The role of alternative data in multi-agent architecture:

Alternative Data Multi-Agent Pipeline

Common Misconceptions

Misconception 1: Alternative data always has Alpha

Not necessarily. Many alternative data:

  • Reacts simultaneously with price (no lead)
  • Too noisy to extract signals
  • Too few samples to validate statistical significance

Misconception 2: LLMs can easily extract sentiment

Be cautious. LLM challenges:

  • Financial domain terminology understanding
  • Sarcasm and pun recognition
  • Consistency and reproducibility
  • Inference cost

Misconception 3: Satellite data is very accurate

Reality is more complex:

  • Cloud cover causes missing data
  • Vehicle detection has errors (10-20%)
  • Seasonal and special events need adjustment
  • Different parking lot layouts affect detection

Practical Recommendations

1. Start with Free Data

Recommended Starting Sources:
- SEC EDGAR (financial statements, 8-K filings)
- Twitter API (requires developer account)
- Reddit API
- Free news APIs

2. Focus on Signal Uniqueness

Ask Yourself:
- Is this signal correlated with price-volume signals?
- How many people are already using this data?
- What's unique about my processing method?

3. Beware of Data Snooping

Validation Process:
1. Discover signal in-sample
2. Test out-of-sample (must be unseen data)
3. Calculate p-value after multiple testing correction
4. Understand the economic logic behind the signal

Summary

Key PointExplanation
Core ValueFind differentiated information beyond price-volume data
Main TypesText sentiment, satellite imagery, transaction data, web traffic
Key ChallengesHigh cost, fast Alpha decay, high noise
Suitable Scale$50M+ to cover data costs
Starting RecommendationFree data + unique processing methods
Cite this chapter
Zhang, Wayland (2026). Background: Alternative Data (NLP and Satellite). In AI Quantitative Trading: From Zero to One. https://waylandz.com/quant-book-en/Alternative-Data-NLP-and-Satellite
@incollection{zhang2026quant_Alternative_Data_NLP_and_Satellite,
  author = {Zhang, Wayland},
  title = {Background: Alternative Data (NLP and Satellite)},
  booktitle = {AI Quantitative Trading: From Zero to One},
  year = {2026},
  url = {https://waylandz.com/quant-book-en/Alternative-Data-NLP-and-Satellite}
}