One-sentence summary: Residual connections let information travel a bypass lane directly to later layers, solving gradient vanishing in deep networks; Dropout randomly drops activations during training to prevent overfitting. These two techniques are the secret to stable Transformer training.

13.1 Revisiting the Transformer Block

Before diving into residual connections and Dropout, let us look at the complete Transformer block structure:

Input X
    ↓
Layer Norm
    ↓
Masked Multi-Head Attention
    ↓
Dropout
    ↓
Residual connection (+ X)   ← first residual
    ↓
Layer Norm
    ↓
Feed Forward Network (FFN)
    ↓
Dropout
    ↓
Residual connection (+ previous output)   ← second residual
    ↓
Output

Each block has two residual connections and two Dropout layers. This chapter explains what they do and why they exist.

13.2 Residual Connections: Information's Bypass Lane

13.2.1 The Problem with Deep Networks

Residual connection: input bypasses the sublayer and is added back to the output

As neural networks get deeper, a serious problem emerges: vanishing gradients.

Imagine information flowing from layer 1 through to layer 12:

Layer 1 → Layer 2 → Layer 3 → ... → Layer 12

Each layer processes the signal. After 12 layers:

The original signal may be severely distorted
Gradients shrink at each layer during backpropagation
Layers near the input receive near-zero gradient updates and essentially learn nothing

This was a well-known problem in deep learning before residual connections.

13.2.2 The Fix: A Bypass Lane

The residual connection idea is simple: let the input skip the layer and be added directly to the output.

Input X ──────────────────────┐
    ↓                         │  (bypass lane)
  Sublayer                    │
    ↓                         │
  Output ←────────────────────┘ + X

The formula:

output = sublayer(X) + X

Instead of only outputting sublayer(X), we add the original input back.

13.2.3 Numeric Example

The diagram shows real values from a trained model run. Here is what the computation looks like:

Attention output (after Dropout):

[4, 16, 512] tensor
First values: -0.07005,  0.09600,  0.03522, ...

Original input X:

[4, 16, 512] tensor
First values:  0.50748, -1.96800,  5.14941, ...

After residual connection:

output = Attention_output + X
       = [-0.07005 + 0.50748,  0.09600 + (-1.96800), ...]
       = [ 0.43743,           -1.87200,              ...]

It is element-wise addition. Nothing exotic.

13.2.4 Why Residual Connections Work

1. Gradient flow

During backpropagation, the gradient through a residual connection is:

\frac{\partial \mathcal{L}}{\partial X} = \frac{\partial \mathcal{L}}{\partial \text{output}} \cdot \left(\frac{\partial \text{sublayer}(X)}{\partial X} + 1\right)

Even if $\frac{\partial \text{sublayer}(X)}{\partial X}$ is close to zero (vanishing gradient problem), the +1 ensures gradient still flows. The bypass lane carries the gradient directly.

2. Identity mapping as a fallback

If a layer does not yet know what to learn, it can default to outputting near-zero:

sublayer(X) ≈ 0
output = 0 + X = X

This effectively makes the layer a no-op. The information passes through unchanged. This is much easier to achieve than learning a perfect identity transformation from scratch.

3. Information preservation

Original information always survives. No matter how many layers exist, the input signal is never completely overwritten — it is always being added back.

13.2.5 Where Residual Connections Sit in the Transformer

First residual connection: after Attention

X → LayerNorm → Attention → Dropout → (+X) → output1

Second residual connection: after FFN

output1 → LayerNorm → FFN → Dropout → (+output1) → output2

13.2.6 PyTorch Implementation

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # First residual connection
        attn_output = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_output)   # residual here

        # Second residual connection
        ffn_output = self.ffn(self.norm2(x))
        x = x + self.dropout(ffn_output)    # residual here

        return x

The two key lines are x = x + .... Everything else is the sublayer computation.

13.3 Dropout: Random Removal Prevents Overfitting

13.3.1 The Overfitting Problem

Dropout randomly zeroes activations during training

Neural networks have a tendency to overfit: they memorize training data rather than learning generalizable patterns.

Think of a model like a student who memorizes every practice problem without understanding the underlying concepts. The student scores 100% on practice tests, then fails on any unfamiliar exam question.

Overfitting is the model's version of that.

13.3.2 Dropout: Random Deactivation

Dropout's idea is surprising in its simplicity: randomly disable some neurons during training.

During each forward pass, Dropout creates a random binary mask. Some activations are zeroed out. Others pass through normally.

Normal:  [0.5, 0.3, 0.8, 0.2, 0.6]   ← all neurons active
Dropout: [0.5, 0.3, 0.0, 0.2, 0.6]   ← 0.8 is zeroed out

Which neurons get dropped? Different ones each time, randomly.

13.3.3 The Intuition

Think of a software team:

Without Dropout: one engineer is brilliant and ends up doing everything. When that engineer leaves, the team collapses.

With Dropout: every day, some team members are randomly "on leave." Everyone must learn to cover multiple responsibilities. The team becomes more robust because no single person is a single point of failure.

In neural networks:

Dropout forces each neuron to function without relying on any specific partner
Each neuron learns to be useful independently
The network becomes more resilient — more distributed representations

13.3.4 The Math

During training:

mask   = random binary tensor, 1 with probability (1 - dropout_rate)
output = input * mask / (1 - dropout_rate)

Example with dropout_rate = 0.1 (dropping 10%):

input  = [0.5, 0.3, 0.8, 0.2, 0.6]
mask   = [1,   1,   0,   1,   1  ]    # 0.8 gets dropped
output = [0.5, 0.3, 0.0, 0.2, 0.6] / 0.9
       = [0.56, 0.33, 0.00, 0.22, 0.67]

During inference:

output = input    # no dropout, pass everything through

13.3.5 Why the Rescaling?

The division by (1 - dropout_rate) keeps the expected value consistent between training and inference.

If 10% of neurons are dropped during training, the remaining 90% are rescaled by 1/0.9 ≈ 1.11. During inference, all 100% of neurons are active, no scaling needed. The expected output magnitude matches in both modes.

Without the rescaling, outputs would be systematically smaller during training than during inference, causing a distribution shift that degrades model quality.

13.3.6 Where Dropout Sits in the Transformer

After Attention: Attention → Dropout → residual connection
After FFN: FFN → Dropout → residual connection

Dropout always appears before the residual addition.

13.3.7 PyTorch Implementation

import torch.nn as nn

# Create Dropout layer
dropout = nn.Dropout(p=0.1)   # drop 10% of activations

# Training mode (model.train() activates dropout)
output = dropout(input)   # randomly drops activations

# Inference mode (model.eval() disables dropout)
output = dropout(input)   # passes everything through unchanged

PyTorch handles the training/inference mode switch automatically. Call model.train() before training, model.eval() before inference.

13.4 Pre-Norm vs Post-Norm

13.4.1 Two Layouts

One subtle architectural choice is where LayerNorm sits relative to the residual connection.

Post-Norm (original Transformer, 2017):

X → Attention → Add(+X) → LayerNorm → FFN → Add → LayerNorm → output

LayerNorm comes after the residual addition.

Pre-Norm (GPT-2 and later):

X → LayerNorm → Attention → Add(+X) → LayerNorm → FFN → Add → output

LayerNorm comes before each sublayer (before Attention, before FFN).

13.4.2 Why Pre-Norm Is Now Standard

Research and practice have converged on Pre-Norm for modern LLMs:

More stable gradients: normalizing the input before each sublayer prevents pathological activations from building up
Cleaner residual path: the residual addition does not pass through a normalization step, so the bypass lane carries the raw signal
Better convergence on deep stacks: especially important for models with 20+ blocks

GPT-2, GPT-3, LLaMA, and essentially all modern decoder-only LLMs use Pre-Norm.

13.5 How Residual Connections and Dropout Work Together

13.5.1 Tracing Data Through the Block

Here is a complete data flow trace for one Transformer block:

Input X  [4, 16, 512]
         ↓
LayerNorm(X)                      # stabilize the input
         ↓
Attention(LayerNorm(X))           # compute context-aware updates
         ↓
Dropout(Attention(...))           # drop some updates (training only)
         ↓
X + Dropout(...)                  # residual: original signal + updates
         ↓
Output1  [4, 16, 512]             # shape preserved

The second sub-block (FFN) follows the same pattern.

13.5.2 Why This Combination Works

Technique	Problem Solved	Mechanism
Residual connection	Gradient vanishing	Direct bypass for gradient flow
Dropout	Overfitting	Random deactivation forces robustness
LayerNorm	Numerical instability	Normalizes activations to stable range

Together:

LayerNorm stabilizes the input before computation
Attention or FFN learns the features
Dropout adds regularization
The residual connection ensures the original signal survives

Remove any one of these, and training becomes measurably harder — or fails entirely for deep stacks.

13.6 Dropout Rates in Practice

13.6.1 Common Configurations

Model	Dropout rate	Notes
GPT-2	0.1	Standard configuration
GPT-3	0.0 – 0.1	Varies across experiments
BERT	0.1	Standard configuration
LLaMA	0.0	No Dropout used

The trend is clear: larger models use less Dropout, and sometimes none at all.

Why? Large models with massive parameter counts trained on huge datasets are less prone to overfitting — the data diversity itself acts as regularization. Additionally, large-scale training runs are expensive enough that practitioners prefer not to risk degraded convergence from aggressive Dropout.

13.6.2 Residual Variants

The original Transformer uses plain addition. Some research has explored variations:

Scaled residual:

x = x + 0.1 * sublayer(x)   # scale down the residual contribution

Gated residual:

gate = torch.sigmoid(linear(x))
x = x + gate * sublayer(x)   # learn how much to trust the sublayer output

These can improve stability in certain settings, but the standard Transformer sticks with simple addition. Simpler tends to generalize better at scale.

13.7 Chapter Summary

13.7.1 Key Concepts

Concept	Purpose	Formula / Effect
Residual connection	Prevent gradient vanishing, preserve information	`output = sublayer(x) + x`
Dropout	Prevent overfitting, force robustness	zero random activations during training
Pre-Norm	Stable training for deep stacks	LayerNorm before each sublayer

13.7.2 Block Layout

Input X
    ↓
LayerNorm → Attention → Dropout → (+X) → output1
                                    ↑
                              residual here

output1
    ↓
LayerNorm → FFN → Dropout → (+output1) → output2
                                  ↑
                            residual here

13.7.3 Core Takeaway

Residual connections and Dropout are the engineering scaffolding that makes deep Transformer training practical. Residual connections give gradients a bypass route so early layers can learn; Dropout prevents the model from memorizing its training data. Neither is glamorous, but remove either one and training quality drops noticeably. The three pieces — residuals, Dropout, and LayerNorm — work together to keep deep stacks stable.

Chapter Checklist

After this chapter, you should be able to:

Explain why residual connections prevent gradient vanishing.
Describe the identity-mapping fallback that residual connections enable.
Explain how Dropout prevents overfitting.
State where residual connections and Dropout sit inside a Transformer block.
Distinguish Pre-Norm from Post-Norm and explain why Pre-Norm is preferred for modern LLMs.

See You in the Next Chapter

The residual connection adds the Attention output back to the original input. But that original input is a combination of two things: the token embedding and the positional encoding.

Chapter 14 asks a question that seems obvious but turns out to be subtle: why do we combine these two signals by adding them, rather than concatenating them?