Background: Deep Learning Model Architecture Selection Guide

Choosing the right architecture is half the battle. Different models suit different scenarios - there is no "universal model."


1. Model Architecture Quick Reference

Model TypeParameter ScaleTraining TimeInference LatencyUse CasesAdvantagesDisadvantages
LSTM1-10MMedium<10msShort-term price prediction, HFTCaptures temporal dependencies, stable trainingPerformance degrades on long sequences
GRU0.5-5MFaster<8msResource-constrained, real-time inferenceFewer parameters, faster trainingSlightly less expressive than LSTM
Transformer10-100MHigh10-50msMulti-asset portfolios, long-term trendsParallel training, long-range dependenciesHigh data requirements, overfitting risk
CNN0.5-5MFast<5msTechnical pattern recognition, pattern matchingLocal feature extraction, efficientWeak temporal modeling
CNN-LSTM Hybrid5-20MMedium-High10-30msMulti-timeframe analysisCombines local and global featuresHigh complexity, difficult to tune

2. LSTM/GRU: The Workhorses of Temporal Modeling

2.1 Architecture Principles

LSTM (Long Short-Term Memory) controls information flow through three gating mechanisms:

Input Gate:  Decides what new information to write to memory
Forget Gate: Decides what old information to discard
Output Gate: Decides what memory information to output

GRU (Gated Recurrent Unit) is a simplified version of LSTM:

  • Combines input and forget gates into a single "update gate"
  • Approximately 25% fewer parameters, faster training
  • Performs comparably to LSTM on small datasets

2.2 Typical Architecture Configurations

Single-asset daily strategy:
├── Input layer: 20-60 timesteps x 10-30 features
├── LSTM layer 1: 128 units + Dropout(0.2)
├── LSTM layer 2: 64 units + Dropout(0.2)
├── Dense layer: 32 units + ReLU
└── Output layer: 1 unit (regression) or 3 units (classification: up/down/flat)

High-frequency trading (minute-level):
├── Input layer: 60-120 timesteps x 50-100 features
├── GRU layer: 256 units (speed priority)
├── Dense layer: 64 units
└── Output layer: Discrete actions (buy/sell/hold)

2.3 When to Choose LSTM/GRU?

ScenarioRecommendationRationale
Data volume <100K samplesLSTM/GRUTransformers easily overfit on small datasets
Sequence length <100 stepsLSTM/GRULSTM is sufficient; Transformer advantages are minimal
Inference latency <10msGRUFewer parameters, faster inference
Single-asset strategyLSTMCaptures temporal patterns of individual assets

2.4 Important Findings

According to the arXiv paper "Vanilla LSTMs Outperform Transformer-based Forecasting":

In financial time series prediction tasks, standard LSTMs often outperform more complex Transformer architectures in scenarios with limited data or shorter sequences.

Reason: Financial data has a low signal-to-noise ratio; complex models tend to learn noise rather than genuine patterns.


3. Transformer: The Choice for Long Sequences and Multi-Asset

3.1 Core Innovations

Self-Attention Mechanism:

  • Attends to all positions in the sequence simultaneously
  • Captures long-range dependencies
  • Supports parallel computation for efficient training

Positional Encoding:

  • Preserves temporal order information
  • Compensates for the attention mechanism's inherent position-agnostic nature

3.2 Financial Domain Variants

VariantImprovementsUse Cases
InformerSparse attention, reduced computational complexityLong sequence prediction (>1000 steps)
AutoformerAutocorrelation mechanism captures periodicityHighly seasonal data
StockFormerEnd-to-end reinforcement learningDirect trading decision output
Higher-Order TransformerHigher-order attention, feature interactionsStock price prediction (+5-10% accuracy)

3.3 When to Choose Transformer?

ScenarioRecommendationRationale
Multi-asset portfolio (>50 assets)TransformerSimultaneously models inter-asset relationships
Long sequences (>200 steps)TransformerStrong long-term dependency modeling
Data volume > 1M samplesTransformerFully utilizes model capacity
Macroeconomic forecastingTransformerCaptures long-term trends

3.4 Caveats

Transformer Pitfalls:
1. High overfitting risk  Requires strong regularization (Dropout >= 0.3)
2. High data requirements  Underperforms LSTM with insufficient samples
3. High computational cost  GPU training is essential
4. Positional encoding sensitivity  Requires adjustment for financial data

4. CNN: The Pattern Recognition Powerhouse

4.1 Application Approaches

1D CNN: Directly processes price sequences

Input: Past 60 days of OHLCV data (60x5 matrix)
Kernels: Multiple sizes (3, 5, 7 days) extract features at different periods
Pooling: Max pooling or average pooling
Output: Feature vector  Classification/regression head

2D CNN: Processes candlestick chart images

Input: Candlestick chart rendered as image (e.g., 224x224x3)
Architecture: Similar to ResNet or VGG
Purpose: Identifies head-and-shoulders, double bottoms, triangles, and other classic patterns

4.2 When to Choose CNN?

ScenarioRecommendationRationale
Technical pattern recognitionCNNExcels at extracting local spatial features
Ultra-low latency requirementsCNNFastest inference speed
Correlation matrix analysis2D CNNVisualizes multi-asset relationships

4.3 Limitations

CNN Issues in Finance:
1. Ignores temporal order  Needs positional encoding or RNN combination
2. Local receptive field  Difficulty capturing long-term dependencies
3. Candlestick chart subjectivity  Different rendering methods affect results

5. Hybrid Architectures: Best of Both Worlds

5.1 CNN-LSTM

Architecture:
Input  CNN (extract local features)  LSTM (model temporal dependencies)  Output

Advantages:
- CNN quickly filters key features
- LSTM captures temporal evolution patterns
- Multi-timeframe fusion

Disadvantages:
- High tuning complexity
- Increased overfitting risk

5.2 LSTM-Transformer

Architecture:
Input  LSTM (local temporal)  Transformer (global context)  Output

Use Cases:
- Markets with both short-term momentum and long-term trends
- Strategies requiring regime switch detection

5.3 Hybrid Architecture Recommendations

Data CharacteristicsRecommended Architecture
Strong short-term + weak long-term dependenciesLSTM-dominant
Weak short-term + strong long-term dependenciesTransformer-dominant
Both equally importantCNN-LSTM or LSTM-Transformer
UncertainStart with LSTM, gradually increase complexity

6. Reinforcement Learning Algorithm Selection

6.1 Core Algorithm Comparison

AlgorithmAnnual ReturnSharpe RatioMax DrawdownSample EfficiencyTraining StabilityUse Cases
DQN8-15%0.6-1.215-25%MediumMedium (prone to divergence)HFT, discrete actions
PPO15-25%1.2-1.810-18%HigherHigh (stable convergence)Medium/low frequency, continuous actions
A3C10-18%0.8-1.412-22%HigherLow (noticeable oscillation)Parallel exploration, resource-constrained
SAC12-20%1.0-1.612-20%HigherMedium-HighHFT, encourages exploration
DDPG8-15%0.6-1.215-25%MediumLowContinuous actions, precise positioning

6.2 Selection Recommendations

Start with PPO  Best balance between stability and performance

If you need discrete actions (buy/sell/hold)  DQN
If you need continuous actions (position sizing)  PPO or SAC
If you want maximum exploration  SAC
If you have resources for parallelization  A3C

7. Practical Selection Workflow

7.1 Decision Tree

                    Data volume > 1M?
                    /            \
                  Yes             No
                   |               |
             Sequence >200?    Sequence &lt;100?
             /        \         /        \
           Yes        No       Yes        No
            |          |        |          |
      Transformer   Hybrid   LSTM      GRU/LSTM

7.2 Quick Selection Table

Your SituationRecommended ArchitectureRationale
Beginner, want quick validationLSTM + PPOMature, stable, abundant tutorials
Daily single-asset strategyLSTMSimple and effective
Minute-level HFT strategyGRU + DQNLow latency
Multi-asset portfolio optimizationTransformerCaptures inter-asset relationships
Technical pattern recognitionCNNExcels at local patterns
Uncertain, want stabilityLSTM → gradually increase complexityAvoid premature optimization

8. Common Misconceptions

Misconception 1: Transformers are always better than LSTMs

Not true. In finance, with limited data and low signal-to-noise ratio, LSTMs are often more robust.

Misconception 2: More complex models are better

The opposite is true. Financial data is noisy; complex models easily overfit. Simple model + good features > Complex model + poor features.

Misconception 3: Copy NLP/CV architecture configurations directly

Financial data has unique properties: non-stationarity, low signal-to-noise ratio, regime changes. Targeted adjustments are necessary.

Misconception 4: Select models based only on backtest metrics

Must also consider: inference latency, deployment complexity, interpretability requirements. In live trading, GRU may be more practical than Transformer.


9. Technical Selection Summary

ComplexityData RelationshipsRecommended Architecture
Simple linearTraditional factorsLightGBM/XGBoost
Medium complexityShort-term temporalLSTM/GRU
Highly nonlinearLong-term dependenciesTransformer
Dynamic decision-makingSequential decisionsReinforcement Learning (PPO)
Multi-modal dataText + numericalLLM + LSTM hybrid

General Training Strategy Recommendations

  1. Experience Replay: Breaks temporal correlation, stabilizes training
  2. Target Network: Delayed updates reduce oscillation
  3. Gradient Clipping: Prevents gradient explosion
  4. Model Ensembling: Reduces single-point-of-failure risk
  5. Rigorous Historical Validation: Walk-Forward testing is essential

10. Further Reading


Core Insight: Model architecture selection is not about chasing the latest and most complex options, but about matching your data scale, latency requirements, and strategy type. Start simple, gradually increase complexity, and validate every decision with Walk-Forward testing.

Cite this chapter
Zhang, Wayland (2026). Background: Deep Learning Model Architecture Selection Guide. In AI Quantitative Trading: From Zero to One. https://waylandz.com/quant-book-en/Model-Architecture-Selection-Guide
@incollection{zhang2026quant_Model_Architecture_Selection_Guide,
  author = {Zhang, Wayland},
  title = {Background: Deep Learning Model Architecture Selection Guide},
  booktitle = {AI Quantitative Trading: From Zero to One},
  year = {2026},
  url = {https://waylandz.com/quant-book-en/Model-Architecture-Selection-Guide}
}