One-sentence summary: Residual connections let information travel a bypass lane directly to later layers, solving gradient vanishing in deep networks; Dropout randomly drops activations during training to prevent overfitting. These two techniques are the secret to stable Transformer training.


13.1 Revisiting the Transformer Block

Before diving into residual connections and Dropout, let us look at the complete Transformer block structure:

Input X
    
Layer Norm
    
Masked Multi-Head Attention
    
Dropout
    
Residual connection (+ X)    first residual
    
Layer Norm
    
Feed Forward Network (FFN)
    
Dropout
    
Residual connection (+ previous output)    second residual
    
Output

Each block has two residual connections and two Dropout layers. This chapter explains what they do and why they exist.


13.2 Residual Connections: Information's Bypass Lane

13.2.1 The Problem with Deep Networks

Residual connection: input bypasses the sublayer and is added back to the output

As neural networks get deeper, a serious problem emerges: vanishing gradients.

Imagine information flowing from layer 1 through to layer 12:

Layer 1  Layer 2  Layer 3  ...  Layer 12

Each layer processes the signal. After 12 layers:

  • The original signal may be severely distorted
  • Gradients shrink at each layer during backpropagation
  • Layers near the input receive near-zero gradient updates and essentially learn nothing

This was a well-known problem in deep learning before residual connections.

13.2.2 The Fix: A Bypass Lane

The residual connection idea is simple: let the input skip the layer and be added directly to the output.

Input X ──────────────────────┐
                               (bypass lane)
  Sublayer                    
                             
  Output ←────────────────────┘ + X

The formula:

output = sublayer(X) + X

Instead of only outputting sublayer(X), we add the original input back.

13.2.3 Numeric Example

The diagram shows real values from a trained model run. Here is what the computation looks like:

Attention output (after Dropout):

[4, 16, 512] tensor
First values: -0.07005,  0.09600,  0.03522, ...

Original input X:

[4, 16, 512] tensor
First values:  0.50748, -1.96800,  5.14941, ...

After residual connection:

output = Attention_output + X
       = [-0.07005 + 0.50748,  0.09600 + (-1.96800), ...]
       = [ 0.43743,           -1.87200,              ...]

It is element-wise addition. Nothing exotic.

13.2.4 Why Residual Connections Work

1. Gradient flow

During backpropagation, the gradient through a residual connection is:

LX=Loutput(sublayer(X)X+1)\frac{\partial \mathcal{L}}{\partial X} = \frac{\partial \mathcal{L}}{\partial \text{output}} \cdot \left(\frac{\partial \text{sublayer}(X)}{\partial X} + 1\right)

Even if sublayer(X)X\frac{\partial \text{sublayer}(X)}{\partial X} is close to zero (vanishing gradient problem), the +1 ensures gradient still flows. The bypass lane carries the gradient directly.

2. Identity mapping as a fallback

If a layer does not yet know what to learn, it can default to outputting near-zero:

sublayer(X)  0
output = 0 + X = X

This effectively makes the layer a no-op. The information passes through unchanged. This is much easier to achieve than learning a perfect identity transformation from scratch.

3. Information preservation

Original information always survives. No matter how many layers exist, the input signal is never completely overwritten — it is always being added back.

13.2.5 Where Residual Connections Sit in the Transformer

First residual connection: after Attention

X  LayerNorm  Attention  Dropout  (+X)  output1

Second residual connection: after FFN

output1  LayerNorm  FFN  Dropout  (+output1)  output2

13.2.6 PyTorch Implementation

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # First residual connection
        attn_output = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_output)   # residual here

        # Second residual connection
        ffn_output = self.ffn(self.norm2(x))
        x = x + self.dropout(ffn_output)    # residual here

        return x

The two key lines are x = x + .... Everything else is the sublayer computation.


13.3 Dropout: Random Removal Prevents Overfitting

13.3.1 The Overfitting Problem

Dropout randomly zeroes activations during training

Neural networks have a tendency to overfit: they memorize training data rather than learning generalizable patterns.

Think of a model like a student who memorizes every practice problem without understanding the underlying concepts. The student scores 100% on practice tests, then fails on any unfamiliar exam question.

Overfitting is the model's version of that.

13.3.2 Dropout: Random Deactivation

Dropout's idea is surprising in its simplicity: randomly disable some neurons during training.

During each forward pass, Dropout creates a random binary mask. Some activations are zeroed out. Others pass through normally.

Normal:  [0.5, 0.3, 0.8, 0.2, 0.6]    all neurons active
Dropout: [0.5, 0.3, 0.0, 0.2, 0.6]    0.8 is zeroed out

Which neurons get dropped? Different ones each time, randomly.

13.3.3 The Intuition

Think of a software team:

Without Dropout: one engineer is brilliant and ends up doing everything. When that engineer leaves, the team collapses.

With Dropout: every day, some team members are randomly "on leave." Everyone must learn to cover multiple responsibilities. The team becomes more robust because no single person is a single point of failure.

In neural networks:

  • Dropout forces each neuron to function without relying on any specific partner
  • Each neuron learns to be useful independently
  • The network becomes more resilient — more distributed representations

13.3.4 The Math

During training:

mask   = random binary tensor, 1 with probability (1 - dropout_rate)
output = input * mask / (1 - dropout_rate)

Example with dropout_rate = 0.1 (dropping 10%):

input  = [0.5, 0.3, 0.8, 0.2, 0.6]
mask   = [1,   1,   0,   1,   1  ]    # 0.8 gets dropped
output = [0.5, 0.3, 0.0, 0.2, 0.6] / 0.9
       = [0.56, 0.33, 0.00, 0.22, 0.67]

During inference:

output = input    # no dropout, pass everything through

13.3.5 Why the Rescaling?

The division by (1 - dropout_rate) keeps the expected value consistent between training and inference.

If 10% of neurons are dropped during training, the remaining 90% are rescaled by 1/0.9 1.11. During inference, all 100% of neurons are active, no scaling needed. The expected output magnitude matches in both modes.

Without the rescaling, outputs would be systematically smaller during training than during inference, causing a distribution shift that degrades model quality.

13.3.6 Where Dropout Sits in the Transformer

  1. After Attention: Attention Dropout residual connection
  2. After FFN: FFN Dropout residual connection

Dropout always appears before the residual addition.

13.3.7 PyTorch Implementation

import torch.nn as nn

# Create Dropout layer
dropout = nn.Dropout(p=0.1)   # drop 10% of activations

# Training mode (model.train() activates dropout)
output = dropout(input)   # randomly drops activations

# Inference mode (model.eval() disables dropout)
output = dropout(input)   # passes everything through unchanged

PyTorch handles the training/inference mode switch automatically. Call model.train() before training, model.eval() before inference.


13.4 Pre-Norm vs Post-Norm

13.4.1 Two Layouts

One subtle architectural choice is where LayerNorm sits relative to the residual connection.

Post-Norm (original Transformer, 2017):

X  Attention  Add(+X)  LayerNorm  FFN  Add  LayerNorm  output

LayerNorm comes after the residual addition.

Pre-Norm (GPT-2 and later):

X  LayerNorm  Attention  Add(+X)  LayerNorm  FFN  Add  output

LayerNorm comes before each sublayer (before Attention, before FFN).

13.4.2 Why Pre-Norm Is Now Standard

Research and practice have converged on Pre-Norm for modern LLMs:

  1. More stable gradients: normalizing the input before each sublayer prevents pathological activations from building up
  2. Cleaner residual path: the residual addition does not pass through a normalization step, so the bypass lane carries the raw signal
  3. Better convergence on deep stacks: especially important for models with 20+ blocks

GPT-2, GPT-3, LLaMA, and essentially all modern decoder-only LLMs use Pre-Norm.


13.5 How Residual Connections and Dropout Work Together

13.5.1 Tracing Data Through the Block

Here is a complete data flow trace for one Transformer block:

Input X  [4, 16, 512]
         
LayerNorm(X)                      # stabilize the input
         
Attention(LayerNorm(X))           # compute context-aware updates
         
Dropout(Attention(...))           # drop some updates (training only)
         
X + Dropout(...)                  # residual: original signal + updates
         
Output1  [4, 16, 512]             # shape preserved

The second sub-block (FFN) follows the same pattern.

13.5.2 Why This Combination Works

TechniqueProblem SolvedMechanism
Residual connectionGradient vanishingDirect bypass for gradient flow
DropoutOverfittingRandom deactivation forces robustness
LayerNormNumerical instabilityNormalizes activations to stable range

Together:

  1. LayerNorm stabilizes the input before computation
  2. Attention or FFN learns the features
  3. Dropout adds regularization
  4. The residual connection ensures the original signal survives

Remove any one of these, and training becomes measurably harder — or fails entirely for deep stacks.


13.6 Dropout Rates in Practice

13.6.1 Common Configurations

ModelDropout rateNotes
GPT-20.1Standard configuration
GPT-30.0 – 0.1Varies across experiments
BERT0.1Standard configuration
LLaMA0.0No Dropout used

The trend is clear: larger models use less Dropout, and sometimes none at all.

Why? Large models with massive parameter counts trained on huge datasets are less prone to overfitting — the data diversity itself acts as regularization. Additionally, large-scale training runs are expensive enough that practitioners prefer not to risk degraded convergence from aggressive Dropout.

13.6.2 Residual Variants

The original Transformer uses plain addition. Some research has explored variations:

Scaled residual:

x = x + 0.1 * sublayer(x)   # scale down the residual contribution

Gated residual:

gate = torch.sigmoid(linear(x))
x = x + gate * sublayer(x)   # learn how much to trust the sublayer output

These can improve stability in certain settings, but the standard Transformer sticks with simple addition. Simpler tends to generalize better at scale.


13.7 Chapter Summary

13.7.1 Key Concepts

ConceptPurposeFormula / Effect
Residual connectionPrevent gradient vanishing, preserve informationoutput = sublayer(x) + x
DropoutPrevent overfitting, force robustnesszero random activations during training
Pre-NormStable training for deep stacksLayerNorm before each sublayer

13.7.2 Block Layout

Input X
    
LayerNorm  Attention  Dropout  (+X)  output1
                                    
                              residual here

output1
    
LayerNorm  FFN  Dropout  (+output1)  output2
                                  
                            residual here

13.7.3 Core Takeaway

Residual connections and Dropout are the engineering scaffolding that makes deep Transformer training practical. Residual connections give gradients a bypass route so early layers can learn; Dropout prevents the model from memorizing its training data. Neither is glamorous, but remove either one and training quality drops noticeably. The three pieces — residuals, Dropout, and LayerNorm — work together to keep deep stacks stable.


Chapter Checklist

After this chapter, you should be able to:

  • Explain why residual connections prevent gradient vanishing.
  • Describe the identity-mapping fallback that residual connections enable.
  • Explain how Dropout prevents overfitting.
  • State where residual connections and Dropout sit inside a Transformer block.
  • Distinguish Pre-Norm from Post-Norm and explain why Pre-Norm is preferred for modern LLMs.

See You in the Next Chapter

The residual connection adds the Attention output back to the original input. But that original input is a combination of two things: the token embedding and the positional encoding.

Chapter 14 asks a question that seems obvious but turns out to be subtle: why do we combine these two signals by adding them, rather than concatenating them?

Cite this page
Zhang, Wayland (2026). Chapter 13: Residual Connections and Dropout - The Secret to Training Stability. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/chapter-13-residual-dropout
@incollection{zhang2026transformer_chapter_13_residual_dropout,
  author = {Zhang, Wayland},
  title = {Chapter 13: Residual Connections and Dropout - The Secret to Training Stability},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/chapter-13-residual-dropout}
}