One-sentence summary: Multi-Head Attention splits Attention into several heads, each learning a different relationship pattern, then merges all views into a single representation.


11.1 Why Multiple Heads?

11.1.1 The Limit of a Single Head

Chapter 10 covered the complete Attention computation — but that was single-head Attention: one set of Q, K, V matrices, one relationship map.

Single-head Attention can learn one attention pattern at a time. Language needs more than one.

Take this sentence:

"The agent opened a pull request because the test suite was green."

Fully understanding it requires tracking several kinds of relationships simultaneously:

  • Syntactic: "opened" takes "agent" as its subject
  • Coreference: what does "it" refer to? (not in this sentence, but typical in longer text)
  • Causal: "because" connects the PR opening to the green tests
  • Positional: "opened" and "pull request" are neighbors

A single head cannot specialize in all of these at once.

11.1.2 The Solution: Multiple Heads in Parallel

Multi-Head Attention's core idea: run several Attention computations in parallel, each in a smaller subspace, so each head can specialize.

Head 1: might focus on syntactic structure (subject-verb-object)
Head 2: might focus on coreference (pronouns and nouns)
Head 3: might focus on local proximity (neighboring tokens)
Head 4: might focus on semantic similarity (related concepts)
...

Then all heads' outputs are merged into a single representation.

11.1.3 An Analogy

Think of code review from multiple teammates:

  • One reviewer checks correctness
  • One checks naming and style
  • One checks test coverage
  • One checks security implications

Each brings a different lens. You merge all their comments into your final understanding. Multi-Head Attention does exactly this — but the "lenses" are learned, not manually assigned.


11.2 Splitting Into Heads

11.2.1 The Dimension Split

Splitting d_model into num_heads subspaces

The key operation in Multi-Head Attention is splitting the model dimension across heads.

Using K (Key) as the example, with:

  • d_model = 512
  • num_heads = 4
  • Therefore d_key = d_model / num_heads = 512 / 4 = 128

The split unfolds as:

Original K: [batch_size, ctx_length, d_model]
          = [4, 16, 512]
              
Split:      [batch_size, ctx_length, num_heads, d_key]
          = [4, 16, 4, 128]
              
Transpose:  [batch_size, num_heads, ctx_length, d_key]
          = [4, 4, 16, 128]

11.2.2 Why Transpose?

The transpose brings num_heads to the second axis, giving the shape [batch, num_heads, seq_len, d_key]. This means:

  • For each sequence in the batch
  • We have num_heads independent Attention computations
  • Each one processes seq_len positions
  • Each position uses a d_key-dimensional vector

With this layout, every head can compute Attention independently, without interfering with the others.

11.2.3 The Same Split Applies to Q, K, and V

Q: [4, 16, 512]  [4, 4, 16, 128]
K: [4, 16, 512]  [4, 4, 16, 128]
V: [4, 16, 512]  [4, 4, 16, 128]

We now have 4 sets of (Q, K, V), ready for 4 independent Attention computations.

11.2.4 Two Equivalent Implementations

Conceptual vs practical multi-head implementation

There are two ways to think about the split, and they are mathematically equivalent:

Conceptual view: each head has its own small Wq, Wk, Wv matrices. Head h computes Q_h = X @ Wq_h with a [d_model, d_key] matrix.

Practical view: one large Wq generates the full Q of shape [batch, seq, d_model], then we reshape and split along the last dimension into num_heads slices.

Real implementations use the practical view because a single large matrix multiplication is more GPU-efficient than many small ones. The GPU prefers large, contiguous operations over many small scattered ones.


11.3 Computing All Heads in Parallel

11.3.1 Each Head Is Independent

All heads computing Attention in parallel

After the split, every head executes the same Attention formula independently:

For each head h = 1, 2, 3, 4:
    scores_h   = Q_h @ K_h^T    [4, 16, 128] @ [4, 128, 16] = [4, 16, 16]
    weights_h  = softmax(scores_h / sqrt(d_key))
    output_h   = weights_h @ V_h    [4, 16, 16] @ [4, 16, 128] = [4, 16, 128]

11.3.2 Dimension Tracking

Q @ K^T for one head:

Q:   [4, 4, 16, 128]
     batch  heads  seq  d_key

K^T: [4, 4, 128, 16]
     batch  heads  d_key  seq

Q @ K^T: [4, 4, 16, 16]
         batch  heads  seq  seq

Softmax(Q @ K^T) @ V:

Attention Weights: [4, 4, 16, 16]
                   batch  heads  seq  seq

V: [4, 4, 16, 128]
   batch  heads  seq  d_key

Output: [4, 4, 16, 128]
        batch  heads  seq  d_key

11.3.3 What the Parallelism Gets You

The total computation is the same as a single-head Attention with width 512. But running four heads of width 128 means each head operates in a smaller, more focused subspace. Each head can develop a clean specialization instead of trying to capture every relationship pattern in one large matrix.


11.4 Merging the Heads Back

11.4.1 Concatenation

Concatenate head outputs and apply Wo projection

After all heads compute their output, we concatenate them back into the full model dimension:

Head outputs: [4, 4, 16, 128]
              batch  heads  seq  d_key
                   
Transpose:    [4, 16, 4, 128]
              batch  seq  heads  d_key
                   
Concatenate:  [4, 16, 512]
              batch  seq  d_model

The concatenation operation just merges the last two dimensions:

  • 4 heads × 128 dimensions = 512 dimensions

11.4.2 The Output Projection Wo

Concatenation is mechanical. It puts the heads' outputs next to each other but does not let them interact. That is what Wo is for:

A @ Wo
[4, 16, 512] @ [512, 512] = [4, 16, 512]

Wo is a learned projection matrix. Its job:

  1. Mix information across heads — what each head learned can now influence the others
  2. Project the concatenated representation into a unified space
  3. Let the model decide how to weight each head's contribution

11.4.3 Why Wo Matters

Without Wo, Head 1's output and Head 3's output sit in different regions of the 512-dimensional vector, and nothing connects them. Wo provides one round of cross-head communication before passing the result to the next block.


11.5 Comparing the Outputs: Before and After Wo

11.5.1 A vs A @ Wo

Comparing concatenated output vs Wo-projected output

Before Wo (A):

  • Shape: [16, 512]
  • Values: the raw concatenation of all heads' output vectors

After Wo (A @ Wo):

  • Shape: [16, 512]
  • Values: a mixed, projected representation

Same shape. Different content. The post-Wo output is what flows into the residual connection, then LayerNorm, then the FFN.


11.6 Full Multi-Head Attention Flow

11.6.1 End to End

Input X [batch, seq, d_model]
         
Generate Q, K, V (via Wq, Wk, Wv)
         
Split into heads [batch, num_heads, seq, d_key]
         
Compute Attention independently per head
         
Concatenate [batch, seq, d_model]
         
Output projection (@ Wo)
         
Output [batch, seq, d_model]

11.6.2 PyTorch Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_key = d_model // num_heads

        # Four learnable weight matrices
        self.Wq = nn.Linear(d_model, d_model)
        self.Wk = nn.Linear(d_model, d_model)
        self.Wv = nn.Linear(d_model, d_model)
        self.Wo = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape

        # 1. Generate Q, K, V
        Q = self.Wq(x)   # [batch, seq, d_model]
        K = self.Wk(x)
        V = self.Wv(x)

        # 2. Split into heads
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_key)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_key)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_key)

        # Transpose: [batch, num_heads, seq, d_key]
        Q = Q.transpose(1, 2)
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)

        # 3. Attention per head
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_key ** 0.5)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention_weights = F.softmax(scores, dim=-1)
        attention_output = torch.matmul(attention_weights, V)

        # 4. Merge heads
        attention_output = attention_output.transpose(1, 2)    # [batch, seq, heads, d_key]
        attention_output = attention_output.contiguous().view(
            batch_size, seq_len, self.d_model
        )

        # 5. Output projection
        output = self.Wo(attention_output)

        return output

11.7 Key Numbers

11.7.1 Parameter Count

Multi-Head Attention has four weight matrices:

MatrixShapeParameters
Wq[d_model, d_model]d_model²
Wk[d_model, d_model]d_model²
Wv[d_model, d_model]d_model²
Wo[d_model, d_model]d_model²

Total: 4 × d_model²

For GPT-2 Small (d_model = 768): 4 × 768² ≈ 2.36 million parameters per Attention layer.

11.7.2 Common Configurations

Modeld_modelnum_headsd_key
GPT-2 Small7681264
GPT-2 Medium10241664
GPT-2 Large12802064
GPT-31228896128
LLaMA-7B409632128

Notice: d_key stays at 64 or 128 across a wide range of model sizes. Bigger models add more heads rather than making each head wider.


11.8 What Do the Heads Actually Learn?

11.8.1 Observed Patterns from Research

Researchers have identified recurring patterns in trained Attention heads:

Head typePatternExample
PositionalAttends to nearby fixed offsetsalways look one position back
SyntacticTracks subject-verb-objectverb attends to its subject
SemanticGroups related conceptssynonyms attend to each other
CoreferenceResolves pronoun references"it" attends to the noun it replaces
DelimiterTracks sentence boundariesattends to punctuation

11.8.2 A Practical Example

For "The agent merged the pull request after review":

Head 1 (positional): "merged" mainly attends to "agent" (adjacent subject)
Head 2 (syntactic):  "merged" mainly attends to "agent" (grammatical subject)
Head 3 (semantic):   "pull request" and "review" attend to each other
Head 4 (coreference): not active here (no pronouns)

11.8.3 Head Redundancy

Not all heads are equally important. Research shows:

  • Some heads can be pruned with minimal performance loss
  • Some heads learn redundant patterns
  • But keeping more heads generally improves robustness and reduces training sensitivity

The right number of heads is empirical. There is no closed-form answer.


11.9 Multi-Head vs Single-Head

11.9.1 Compute Comparison

For d_model = 512, num_heads = 8, d_key = 64:

Single head (with d_key = 512):

  • Q @ K^T: [seq, 512] @ [512, seq] → O(seq² × 512)

Eight heads (with d_key = 64 each):

  • Each head: [seq, 64] @ [64, seq] → O(seq² × 64)
  • Total: 8 × O(seq² × 64) = O(seq² × 512)

Same total compute. Different capability.

11.9.2 Why Not More Heads?

More heads means smaller d_key:

d_key = d_model / num_heads

If d_key gets too small, each head has too few dimensions to represent a useful subspace. Empirically, d_key = 64 or d_key = 128 is the practical sweet spot.


11.10 Part 3 Wrap-Up

This chapter closes Part 3: Attention Mechanisms. Here is what we covered across the four chapters:

ChapterTopicCore idea
Chapter 8Linear TransformsMatrix multiplication as projection and similarity
Chapter 9Attention GeometryDot product as a similarity measure
Chapter 10Q, K, VThe three roles and the full computation
Chapter 11Multi-HeadParallel views; concatenation and Wo

The complete Multi-Head Attention formula:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O

Where:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q,\, KW_i^K,\, VW_i^V)
Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Chapter Checklist

After this chapter, you should be able to:

  • Explain why a single Attention head has limitations.
  • Derive the relationship d_key = d_model / num_heads.
  • Trace dimension changes through split, compute, and merge.
  • Explain what Wo does after concatenation.
  • Describe the kinds of patterns different heads can learn.

See You in the Next Chapter

Chapter 12 closes out Part 3 by tying up the remaining conceptual thread: what does the Attention output actually represent, and what are the two things training is adjusting simultaneously?

Part 4 follows immediately after, assembling all of the components we have built — tokenization, positional encoding, Attention, and the FFN — into the full Transformer block architecture.

Cite this page
Zhang, Wayland (2026). Chapter 11: Multi-Head Attention - Several Views at Once. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/chapter-11-multi-head-attention
@incollection{zhang2026transformer_chapter_11_multi_head_attention,
  author = {Zhang, Wayland},
  title = {Chapter 11: Multi-Head Attention - Several Views at Once},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/chapter-11-multi-head-attention}
}