One-sentence summary: QKV Attention is not predicting anything — it is continuously adjusting each token's embedding vector so that it becomes meaningful in context.
12.1 A Short Chapter With a Big Payoff
This chapter wraps up Multi-Head Attention. In the previous chapters, we derived the output tensor A step by step. You should have a clear picture of how it is computed.
What many explanations skip is what A actually means — and what the model is doing when it produces it.
Three things to cover:
- Concatenate: merging multiple heads back into a single tensor
- Linear transform Wo: the final matrix multiplication
- The essence of training: what QKV is actually adjusting
Once these are clear, Layer Normalization and residual connections become much easier to understand.
12.2 The Shape of A: A Four-Dimensional Tensor
12.2.1 Attention Visualization
The diagram shows the core Attention process:
- Q (Query): the query vector for the current token being processed
- K (Key): the key matrix for all tokens in the context
- V (Value): the value matrix for all tokens in the context
Q dot-products against each row of K to produce attention weights, then those weights blend the rows of V. That blend is the output for this token: Q queries K, finds relevant positions, retrieves and aggregates from V.
12.2.2 Shape Breakdown
The Multi-Head Attention output A is a four-dimensional tensor:
A: [4, 4, 16, 128]
│ │ │ └── per-head dimension (d_head = 512 / 4 = 128)
│ │ └────── sequence length (seq_len = 16)
│ └────────── number of heads (num_heads = 4)
└───────────── batch size (batch_size = 4)
Breaking it down:
- First 4: 4 sequences in the batch
- Second 4: the 512-dimensional model width split into 4 heads
- 16: 16 tokens per sequence
- 128: each head's subspace dimension
The actual computation happens on each [16, 128] slice — one per head per sequence. The four-dimensional shape is just the packaging.
┌─────────────────────────────────────┐
│ Sequence 1 │
│ ┌──────┬──────┬──────┬──────┐ │
│ │ Head1│ Head2│ Head3│ Head4│ │
│ │16×128│16×128│16×128│16×128│ │
│ └──────┴──────┴──────┴──────┘ │
├─────────────────────────────────────┤
│ Sequence 2 (same structure) │
├─────────────────────────────────────┤
│ Sequence 3 (same structure) │
├─────────────────────────────────────┤
│ Sequence 4 (same structure) │
└─────────────────────────────────────┘
12.2.3 Real Model Sizes
For GPT-2 (117M parameters):
d_model = 768,num_heads = 12,d_head = 64
For LLaMA-7B:
d_model = 4096,num_heads = 32,d_head = 128
The same four-dimensional structure, just much larger in practice.
12.3 Concatenate: Merging Heads Back Together
12.3.1 The Merge Operation
Concatenate (often called "concat") is the step that reassembles the heads into a single tensor. It is the inverse of the split from Chapter 11.
Before concat: [4, 4, 16, 128] → 4 heads, each 128-dimensional
After concat: [4, 16, 512] → 1 unified 512-dimensional tensor
We merge the last two dimensions [4, 128] back into [512]. That is it — a reshape operation.
12.3.2 Why Split and Merge at All?
The question I had when I first learned this: why cut the vector into pieces, compute Attention on each piece, then glue them back? What does the detour accomplish?
The answer is multi-perspective representation.
Each head operates in a different subspace of the full model dimension. They do not share parameters. Over training, they tend to specialize:
- Head 1 might learn syntax-sensitive patterns
- Head 2 might learn semantic similarity
- Head 3 might learn positional proximity
- Head 4 might learn topic continuity
Splitting forces this specialization. Merging allows those specialized views to inform a single representation.
A tradeoff to keep in mind: more heads means richer representational capacity, but also more parameters and more computation. The empirical sweet spot is usually
d_head = 64ord_head = 128. There is no theoretical formula for the right number of heads — it is tuned experimentally.
12.4 Wo: The Final Linear Transform
12.4.1 What Wo Is
After concatenation, one final matrix multiplication:
Shape of Wo: [512, 512]
Operation: A @ Wo → final output
Wo (the Output weight matrix) is structurally identical to Wq, Wk, and Wv:
- Shape:
[d_model, d_model]=[512, 512] - Initialization: random
- Type: trainable parameters
12.4.2 Weight Sharing Rules
This confused me when I was learning, so I want to be explicit.
Within one Transformer block: all heads share one Wq, one Wk, one Wv, and there is one Wo. The heads are not separate modules — they share the projection matrices (via the reshape trick from Chapter 11), and Wo recombines their outputs.
Across Transformer blocks: each block has its own independent set of Wq, Wk, Wv, and Wo. A 12-block model has 12 separate sets of these matrices, each learning something slightly different.
Block 1: Wq₁, Wk₁, Wv₁, Wo₁ ← first set
Block 2: Wq₂, Wk₂, Wv₂, Wo₂ ← second set
...
Block 12: Wq₁₂, Wk₁₂, Wv₁₂, Wo₁₂ ← twelfth set
Each block computes its own Attention with its own weights, refining the representation one level deeper.
12.4.3 In PyTorch
When using PyTorch's nn.MultiheadAttention, the weight matrices are handled internally:
self.attn = nn.MultiheadAttention(embed_dim=512, num_heads=4)
# Internally:
# self.attn.in_proj_weight → packs Wq, Wk, Wv together
# self.attn.out_proj.weight → this is Wo
The Hugging Face transformers library wraps this further, but the same four matrices are there.
12.5 What Q × K Is Actually Computing
12.5.1 The Score Matrix
Let's revisit what Q multiplied by K produces. Taking the first batch, first head, we get a 16×16 square matrix.
Token1 Token2 Token3 ... Token16
Token1 [ 0.20 0.10 0.05 ... 0.01 ]
Token2 [ 0.15 0.30 0.10 ... 0.02 ]
Token3 [ 0.08 0.12 0.25 ... 0.03 ]
...
Token16 [ 0.01 0.02 0.01 ... 0.40 ]
12.5.2 Geometric Intuition
How to read this matrix:
- Each row: one token's attention perspective
- Each column: one token's visibility to others
- Each cell: the attention weight from row token to column token
After Softmax, each row sums to 1 — it is a probability distribution over the sequence.
So Q × K is computing: for every token, what percentage of attention should it give to every other token?
The matrix form is what makes this efficient. We compute all pairwise relationships at once instead of looping.
12.5.3 A Concrete Example
Imagine the model processing "The agent merged the pull request after review."
When attending from "merged":
- "merged" → "agent": maybe 30% (subject of the action)
- "merged" → "pull request": maybe 25% (object of the action)
- "merged" → "review": maybe 20% (context for the action)
- "merged" → remaining tokens: the remaining 25%
These percentages come from Q × K. They tell the model where to look.
12.6 What V Does: Applying the Attention to Content
12.6.1 V as the Carrier
The score matrix from Q × K is a "map" — it says where to look but carries no content itself. V is the content.
In our setup:
- 16 tokens in the sequence
- Each token has a 128-dimensional V vector (in this head's subspace)
12.6.2 What the Multiplication Does
Multiplying the score matrix by V:
(Q × K) × V → shape still 16×128, unchanged
This is the core operation: use the attention percentages to update each token's vector.
Each token's output vector is a weighted sum of all V vectors, where the weights are the attention scores. Tokens that got high attention contribute more of their V content to the output.
The original embeddings started as random initializations with no semantic meaning. After this operation — and after thousands of training steps — these values become meaningful: they encode what each token represents in the context of the surrounding sequence.
12.7 Training: The Two Things Being Adjusted
12.7.1 One Step at a Time
During training, each forward pass makes small adjustments. Then the next forward pass makes more adjustments. This continues for tens of thousands of steps (or more, for large models).
12.7.2 A Concrete Example of Token Embedding Updates
Imagine the training corpus includes many sequences containing the word "agent."
First training step: the token embedding for "agent" is random — no meaningful values.
After the first forward pass and backward pass: the embedding shifts slightly toward values that help predict what comes next after "agent" in context.
Second step: the next time "agent" appears, we use the updated embedding from step 1. Another small adjustment follows.
Step N: the embedding for "agent" now encodes rich information — not just "this is the word agent" but "an entity that takes actions, appears in agentic contexts, is often followed by verbs like 'opened' or 'merged'."
Initial "agent" embedding: [random numbers]
After step 1: [slightly adjusted]
After step 2: [more meaningful]
...
After step N: [semantically rich]
This is why we call it an Embedding — the word gets embedded into a meaningful vector space.
12.7.3 Two Kinds of Parameters Being Updated
QKV Attention simultaneously refines two different sets of parameters:
Part 1: Token Embeddings
- The lookup table that maps token IDs to vectors
- Updated so each token's vector captures its meaning in context
- Shared across all layers (one embedding per token ID)
Part 2: Weight Matrices
- Wq, Wk, Wv, Wo — the linear transforms inside Attention
- Updated so the Attention mechanism finds useful relationships
- Separate per block (12 sets for a 12-block model)
These two update each other. Better token embeddings lead to better Attention scores. Better weight matrices lead to better token embedding updates. They converge together.
12.8 The Full Picture: What Multi-Head Attention Does
12.8.1 Placing It in the Architecture
The Multi-Head Attention module inside each Transformer block does:
- Updates token embeddings: every pass through this module adjusts the embedding vectors based on what surrounds them in the sequence
- Updates weight matrices: Wq, Wk, Wv, and Wo are all trained parameters that improve through backpropagation
12.8.2 Parameter Count
For a 12-block model with d_model = 512:
- Each block: 4 weight matrices × 512² = 4 × 262,144 = 1,048,576 parameters
- 12 blocks: 12 × 1,048,576 ≈ 12.6 million parameters (Attention only)
- Plus: Token Embedding table, FFN layers, Layer Norms, and output projection
After enough training steps, these parameters settle into values that make the model capable of coherent text generation.
12.8.3 Model Scale Reference
| Model | Layers | d_model | Total Parameters |
|---|---|---|---|
| GPT-2 Small | 12 | 768 | 117M |
| GPT-2 Medium | 24 | 1024 | 345M |
| GPT-2 Large | 36 | 1280 | 774M |
| LLaMA-7B | 32 | 4096 | 7B |
| LLaMA-70B | 80 | 8192 | 70B |
Chapter Checklist
After this chapter, you should be able to:
- Describe the four-dimensional shape
[batch, heads, seq_len, d_head]and what each dimension means. - Explain what Concatenate does and why the split-then-merge is useful.
- Explain what Wo does and how weights are shared within and across blocks.
- Describe what Q × K computes (attention percentages).
- Explain what (Q × K) × V does (updating token vectors using attention weights).
- Describe the two things training simultaneously adjusts: token embeddings and weight matrices.
See You in the Next Chapter
That is enough for this chapter. If you can explain what every output dimension of the Attention tensor represents — which head, which token, which subspace — without looking at the diagram, you are ready for Chapter 13.
Chapter 13 covers residual connections and Dropout — the engineering tricks that let deep Transformers train stably. Now that you know what Attention produces, you will immediately see why the residual connection pattern makes sense.