Background: Deep Learning Model Architecture Selection Guide

Choosing the right architecture is half the battle. Different models suit different scenarios - there is no "universal model."

1. Model Architecture Quick Reference

Model Type	Parameter Scale	Training Time	Inference Latency	Use Cases	Advantages	Disadvantages
LSTM	1-10M	Medium	<10ms	Short-term price prediction, HFT	Captures temporal dependencies, stable training	Performance degrades on long sequences
GRU	0.5-5M	Faster	<8ms	Resource-constrained, real-time inference	Fewer parameters, faster training	Slightly less expressive than LSTM
Transformer	10-100M	High	10-50ms	Multi-asset portfolios, long-term trends	Parallel training, long-range dependencies	High data requirements, overfitting risk
CNN	0.5-5M	Fast	<5ms	Technical pattern recognition, pattern matching	Local feature extraction, efficient	Weak temporal modeling
CNN-LSTM Hybrid	5-20M	Medium-High	10-30ms	Multi-timeframe analysis	Combines local and global features	High complexity, difficult to tune

2. LSTM/GRU: The Workhorses of Temporal Modeling

2.1 Architecture Principles

LSTM (Long Short-Term Memory) controls information flow through three gating mechanisms:

Input Gate:  Decides what new information to write to memory
Forget Gate: Decides what old information to discard
Output Gate: Decides what memory information to output

GRU (Gated Recurrent Unit) is a simplified version of LSTM:

Combines input and forget gates into a single "update gate"
Approximately 25% fewer parameters, faster training
Performs comparably to LSTM on small datasets

2.2 Typical Architecture Configurations

Single-asset daily strategy:
├── Input layer: 20-60 timesteps x 10-30 features
├── LSTM layer 1: 128 units + Dropout(0.2)
├── LSTM layer 2: 64 units + Dropout(0.2)
├── Dense layer: 32 units + ReLU
└── Output layer: 1 unit (regression) or 3 units (classification: up/down/flat)

High-frequency trading (minute-level):
├── Input layer: 60-120 timesteps x 50-100 features
├── GRU layer: 256 units (speed priority)
├── Dense layer: 64 units
└── Output layer: Discrete actions (buy/sell/hold)

2.3 When to Choose LSTM/GRU?

Scenario	Recommendation	Rationale
Data volume <100K samples	LSTM/GRU	Transformers easily overfit on small datasets
Sequence length <100 steps	LSTM/GRU	LSTM is sufficient; Transformer advantages are minimal
Inference latency <10ms	GRU	Fewer parameters, faster inference
Single-asset strategy	LSTM	Captures temporal patterns of individual assets

2.4 Important Findings

According to the arXiv paper "Vanilla LSTMs Outperform Transformer-based Forecasting":

In financial time series prediction tasks, standard LSTMs often outperform more complex Transformer architectures in scenarios with limited data or shorter sequences.

Reason: Financial data has a low signal-to-noise ratio; complex models tend to learn noise rather than genuine patterns.

3. Transformer: The Choice for Long Sequences and Multi-Asset

3.1 Core Innovations

Self-Attention Mechanism:

Attends to all positions in the sequence simultaneously
Captures long-range dependencies
Supports parallel computation for efficient training

Positional Encoding:

Preserves temporal order information
Compensates for the attention mechanism's inherent position-agnostic nature

3.2 Financial Domain Variants

Variant	Improvements	Use Cases
Informer	Sparse attention, reduced computational complexity	Long sequence prediction (>1000 steps)
Autoformer	Autocorrelation mechanism captures periodicity	Highly seasonal data
StockFormer	End-to-end reinforcement learning	Direct trading decision output
Higher-Order Transformer	Higher-order attention, feature interactions	Stock price prediction (+5-10% accuracy)

3.3 When to Choose Transformer?

Scenario	Recommendation	Rationale
Multi-asset portfolio (>50 assets)	Transformer	Simultaneously models inter-asset relationships
Long sequences (>200 steps)	Transformer	Strong long-term dependency modeling
Data volume > 1M samples	Transformer	Fully utilizes model capacity
Macroeconomic forecasting	Transformer	Captures long-term trends

3.4 Caveats

Transformer Pitfalls:
1. High overfitting risk → Requires strong regularization (Dropout >= 0.3)
2. High data requirements → Underperforms LSTM with insufficient samples
3. High computational cost → GPU training is essential
4. Positional encoding sensitivity → Requires adjustment for financial data

4. CNN: The Pattern Recognition Powerhouse

4.1 Application Approaches

1D CNN: Directly processes price sequences

Input: Past 60 days of OHLCV data (60x5 matrix)
Kernels: Multiple sizes (3, 5, 7 days) extract features at different periods
Pooling: Max pooling or average pooling
Output: Feature vector → Classification/regression head

2D CNN: Processes candlestick chart images

Input: Candlestick chart rendered as image (e.g., 224x224x3)
Architecture: Similar to ResNet or VGG
Purpose: Identifies head-and-shoulders, double bottoms, triangles, and other classic patterns

4.2 When to Choose CNN?

Scenario	Recommendation	Rationale
Technical pattern recognition	CNN	Excels at extracting local spatial features
Ultra-low latency requirements	CNN	Fastest inference speed
Correlation matrix analysis	2D CNN	Visualizes multi-asset relationships

4.3 Limitations

CNN Issues in Finance:
1. Ignores temporal order → Needs positional encoding or RNN combination
2. Local receptive field → Difficulty capturing long-term dependencies
3. Candlestick chart subjectivity → Different rendering methods affect results

5. Hybrid Architectures: Best of Both Worlds

5.1 CNN-LSTM

Architecture:
Input → CNN (extract local features) → LSTM (model temporal dependencies) → Output

Advantages:
- CNN quickly filters key features
- LSTM captures temporal evolution patterns
- Multi-timeframe fusion

Disadvantages:
- High tuning complexity
- Increased overfitting risk

5.2 LSTM-Transformer

Architecture:
Input → LSTM (local temporal) → Transformer (global context) → Output

Use Cases:
- Markets with both short-term momentum and long-term trends
- Strategies requiring regime switch detection

5.3 Hybrid Architecture Recommendations

Data Characteristics	Recommended Architecture
Strong short-term + weak long-term dependencies	LSTM-dominant
Weak short-term + strong long-term dependencies	Transformer-dominant
Both equally important	CNN-LSTM or LSTM-Transformer
Uncertain	Start with LSTM, gradually increase complexity

6. Reinforcement Learning Algorithm Selection

6.1 Core Algorithm Comparison

Algorithm	Annual Return	Sharpe Ratio	Max Drawdown	Sample Efficiency	Training Stability	Use Cases
DQN	8-15%	0.6-1.2	15-25%	Medium	Medium (prone to divergence)	HFT, discrete actions
PPO	15-25%	1.2-1.8	10-18%	Higher	High (stable convergence)	Medium/low frequency, continuous actions
A3C	10-18%	0.8-1.4	12-22%	Higher	Low (noticeable oscillation)	Parallel exploration, resource-constrained
SAC	12-20%	1.0-1.6	12-20%	Higher	Medium-High	HFT, encourages exploration
DDPG	8-15%	0.6-1.2	15-25%	Medium	Low	Continuous actions, precise positioning

6.2 Selection Recommendations

Start with PPO → Best balance between stability and performance

If you need discrete actions (buy/sell/hold) → DQN
If you need continuous actions (position sizing) → PPO or SAC
If you want maximum exploration → SAC
If you have resources for parallelization → A3C

7. Practical Selection Workflow

7.1 Decision Tree

                    Data volume > 1M?
                    /            \
                  Yes             No
                   |               |
             Sequence >200?    Sequence &lt;100?
             /        \         /        \
           Yes        No       Yes        No
            |          |        |          |
      Transformer   Hybrid   LSTM      GRU/LSTM

7.2 Quick Selection Table

Your Situation	Recommended Architecture	Rationale
Beginner, want quick validation	LSTM + PPO	Mature, stable, abundant tutorials
Daily single-asset strategy	LSTM	Simple and effective
Minute-level HFT strategy	GRU + DQN	Low latency
Multi-asset portfolio optimization	Transformer	Captures inter-asset relationships
Technical pattern recognition	CNN	Excels at local patterns
Uncertain, want stability	LSTM → gradually increase complexity	Avoid premature optimization

8. Common Misconceptions

Misconception 1: Transformers are always better than LSTMs

Not true. In finance, with limited data and low signal-to-noise ratio, LSTMs are often more robust.

Misconception 2: More complex models are better

The opposite is true. Financial data is noisy; complex models easily overfit. Simple model + good features > Complex model + poor features.

Misconception 3: Copy NLP/CV architecture configurations directly

Financial data has unique properties: non-stationarity, low signal-to-noise ratio, regime changes. Targeted adjustments are necessary.

Misconception 4: Select models based only on backtest metrics

Must also consider: inference latency, deployment complexity, interpretability requirements. In live trading, GRU may be more practical than Transformer.

9. Technical Selection Summary

Complexity	Data Relationships	Recommended Architecture
Simple linear	Traditional factors	LightGBM/XGBoost
Medium complexity	Short-term temporal	LSTM/GRU
Highly nonlinear	Long-term dependencies	Transformer
Dynamic decision-making	Sequential decisions	Reinforcement Learning (PPO)
Multi-modal data	Text + numerical	LLM + LSTM hybrid

General Training Strategy Recommendations

Experience Replay: Breaks temporal correlation, stabilizes training
Target Network: Delayed updates reduce oscillation
Gradient Clipping: Prevents gradient explosion
Model Ensembling: Reduces single-point-of-failure risk
Rigorous Historical Validation: Walk-Forward testing is essential

10. Further Reading

Background: Reinforcement Learning in Trading - Detailed RL introduction
Background: Frontier ML and RL Methods (2025) - Latest technical advances
Background: Time Series Cross-Validation (Purged CV) - Proper validation methods
arXiv: "Vanilla LSTMs Outperform Transformer-based Forecasting"
arXiv: "Higher Order Transformers: Enhancing Stock Movement Prediction"

Core Insight: Model architecture selection is not about chasing the latest and most complex options, but about matching your data scale, latency requirements, and strategy type. Start simple, gradually increase complexity, and validate every decision with Walk-Forward testing.