One-sentence summary: QKV Attention is not predicting anything — it is continuously adjusting each token's embedding vector so that it becomes meaningful in context.

12.1 A Short Chapter With a Big Payoff

This chapter wraps up Multi-Head Attention. In the previous chapters, we derived the output tensor A step by step. You should have a clear picture of how it is computed.

What many explanations skip is what A actually means — and what the model is doing when it produces it.

Three things to cover:

Concatenate: merging multiple heads back into a single tensor
Linear transform Wo: the final matrix multiplication
The essence of training: what QKV is actually adjusting

Once these are clear, Layer Normalization and residual connections become much easier to understand.

12.2 The Shape of A: A Four-Dimensional Tensor

12.2.1 Attention Visualization

Q matches against K to produce attention weights, which blend V vectors

The diagram shows the core Attention process:

Q (Query): the query vector for the current token being processed
K (Key): the key matrix for all tokens in the context
V (Value): the value matrix for all tokens in the context

Q dot-products against each row of K to produce attention weights, then those weights blend the rows of V. That blend is the output for this token: Q queries K, finds relevant positions, retrieves and aggregates from V.

12.2.2 Shape Breakdown

The Multi-Head Attention output A is a four-dimensional tensor:

A: [4, 4, 16, 128]
    │  │   │   └── per-head dimension (d_head = 512 / 4 = 128)
    │  │   └────── sequence length (seq_len = 16)
    │  └────────── number of heads (num_heads = 4)
    └───────────── batch size (batch_size = 4)

Breaking it down:

First 4: 4 sequences in the batch
Second 4: the 512-dimensional model width split into 4 heads
16: 16 tokens per sequence
128: each head's subspace dimension

The actual computation happens on each [16, 128] slice — one per head per sequence. The four-dimensional shape is just the packaging.

┌─────────────────────────────────────┐
│  Sequence 1                         │
│  ┌──────┬──────┬──────┬──────┐      │
│  │ Head1│ Head2│ Head3│ Head4│      │
│  │16×128│16×128│16×128│16×128│      │
│  └──────┴──────┴──────┴──────┘      │
├─────────────────────────────────────┤
│  Sequence 2  (same structure)       │
├─────────────────────────────────────┤
│  Sequence 3  (same structure)       │
├─────────────────────────────────────┤
│  Sequence 4  (same structure)       │
└─────────────────────────────────────┘

12.2.3 Real Model Sizes

For GPT-2 (117M parameters):

d_model = 768, num_heads = 12, d_head = 64

For LLaMA-7B:

d_model = 4096, num_heads = 32, d_head = 128

The same four-dimensional structure, just much larger in practice.

12.3 Concatenate: Merging Heads Back Together

12.3.1 The Merge Operation

Concatenate (often called "concat") is the step that reassembles the heads into a single tensor. It is the inverse of the split from Chapter 11.

Before concat: [4, 4, 16, 128]   →   4 heads, each 128-dimensional
After concat:  [4, 16, 512]      →   1 unified 512-dimensional tensor

We merge the last two dimensions [4, 128] back into [512]. That is it — a reshape operation.

12.3.2 Why Split and Merge at All?

The question I had when I first learned this: why cut the vector into pieces, compute Attention on each piece, then glue them back? What does the detour accomplish?

The answer is multi-perspective representation.

Each head operates in a different subspace of the full model dimension. They do not share parameters. Over training, they tend to specialize:

Head 1 might learn syntax-sensitive patterns
Head 2 might learn semantic similarity
Head 3 might learn positional proximity
Head 4 might learn topic continuity

Splitting forces this specialization. Merging allows those specialized views to inform a single representation.

A tradeoff to keep in mind: more heads means richer representational capacity, but also more parameters and more computation. The empirical sweet spot is usually d_head = 64 or d_head = 128. There is no theoretical formula for the right number of heads — it is tuned experimentally.

12.4 Wo: The Final Linear Transform

12.4.1 What Wo Is

After concatenation, one final matrix multiplication:

Shape of Wo: [512, 512]
Operation:   A @ Wo → final output

Wo (the Output weight matrix) is structurally identical to Wq, Wk, and Wv:

Shape: [d_model, d_model] = [512, 512]
Initialization: random
Type: trainable parameters

This confused me when I was learning, so I want to be explicit.

Within one Transformer block: all heads share one Wq, one Wk, one Wv, and there is one Wo. The heads are not separate modules — they share the projection matrices (via the reshape trick from Chapter 11), and Wo recombines their outputs.

Across Transformer blocks: each block has its own independent set of Wq, Wk, Wv, and Wo. A 12-block model has 12 separate sets of these matrices, each learning something slightly different.

Block 1:  Wq₁, Wk₁, Wv₁, Wo₁   ← first set
Block 2:  Wq₂, Wk₂, Wv₂, Wo₂   ← second set
...
Block 12: Wq₁₂, Wk₁₂, Wv₁₂, Wo₁₂   ← twelfth set

Each block computes its own Attention with its own weights, refining the representation one level deeper.

12.4.3 In PyTorch

When using PyTorch's nn.MultiheadAttention, the weight matrices are handled internally:

self.attn = nn.MultiheadAttention(embed_dim=512, num_heads=4)

# Internally:
# self.attn.in_proj_weight  →  packs Wq, Wk, Wv together
# self.attn.out_proj.weight →  this is Wo

The Hugging Face transformers library wraps this further, but the same four matrices are there.

12.5 What Q × K Is Actually Computing

12.5.1 The Score Matrix

Let's revisit what Q multiplied by K produces. Taking the first batch, first head, we get a 16×16 square matrix.

           Token1 Token2 Token3 ... Token16
Token1  [  0.20   0.10   0.05  ...  0.01  ]
Token2  [  0.15   0.30   0.10  ...  0.02  ]
Token3  [  0.08   0.12   0.25  ...  0.03  ]
...
Token16 [  0.01   0.02   0.01  ...  0.40  ]

12.5.2 Geometric Intuition

How to read this matrix:

Each row: one token's attention perspective
Each column: one token's visibility to others
Each cell: the attention weight from row token to column token

After Softmax, each row sums to 1 — it is a probability distribution over the sequence.

So Q × K is computing: for every token, what percentage of attention should it give to every other token?

The matrix form is what makes this efficient. We compute all pairwise relationships at once instead of looping.

12.5.3 A Concrete Example

Imagine the model processing "The agent merged the pull request after review."

When attending from "merged":

"merged" → "agent": maybe 30% (subject of the action)
"merged" → "pull request": maybe 25% (object of the action)
"merged" → "review": maybe 20% (context for the action)
"merged" → remaining tokens: the remaining 25%

These percentages come from Q × K. They tell the model where to look.

12.6 What V Does: Applying the Attention to Content

12.6.1 V as the Carrier

The score matrix from Q × K is a "map" — it says where to look but carries no content itself. V is the content.

In our setup:

16 tokens in the sequence
Each token has a 128-dimensional V vector (in this head's subspace)

12.6.2 What the Multiplication Does

Multiplying the score matrix by V:

(Q × K) × V   →   shape still 16×128, unchanged

This is the core operation: use the attention percentages to update each token's vector.

Each token's output vector is a weighted sum of all V vectors, where the weights are the attention scores. Tokens that got high attention contribute more of their V content to the output.

The original embeddings started as random initializations with no semantic meaning. After this operation — and after thousands of training steps — these values become meaningful: they encode what each token represents in the context of the surrounding sequence.

12.7 Training: The Two Things Being Adjusted

12.7.1 One Step at a Time

During training, each forward pass makes small adjustments. Then the next forward pass makes more adjustments. This continues for tens of thousands of steps (or more, for large models).

12.7.2 A Concrete Example of Token Embedding Updates

Imagine the training corpus includes many sequences containing the word "agent."

First training step: the token embedding for "agent" is random — no meaningful values.

After the first forward pass and backward pass: the embedding shifts slightly toward values that help predict what comes next after "agent" in context.

Second step: the next time "agent" appears, we use the updated embedding from step 1. Another small adjustment follows.

Step N: the embedding for "agent" now encodes rich information — not just "this is the word agent" but "an entity that takes actions, appears in agentic contexts, is often followed by verbs like 'opened' or 'merged'."

Initial "agent" embedding: [random numbers]
After step 1:              [slightly adjusted]
After step 2:              [more meaningful]
...
After step N:              [semantically rich]

This is why we call it an Embedding — the word gets embedded into a meaningful vector space.

12.7.3 Two Kinds of Parameters Being Updated

QKV Attention simultaneously refines two different sets of parameters:

Part 1: Token Embeddings

The lookup table that maps token IDs to vectors
Updated so each token's vector captures its meaning in context
Shared across all layers (one embedding per token ID)

Part 2: Weight Matrices

Wq, Wk, Wv, Wo — the linear transforms inside Attention
Updated so the Attention mechanism finds useful relationships
Separate per block (12 sets for a 12-block model)

These two update each other. Better token embeddings lead to better Attention scores. Better weight matrices lead to better token embedding updates. They converge together.

12.8 The Full Picture: What Multi-Head Attention Does

12.8.1 Placing It in the Architecture

The Multi-Head Attention module inside each Transformer block does:

Updates token embeddings: every pass through this module adjusts the embedding vectors based on what surrounds them in the sequence
Updates weight matrices: Wq, Wk, Wv, and Wo are all trained parameters that improve through backpropagation

12.8.2 Parameter Count

For a 12-block model with d_model = 512:

Each block: 4 weight matrices × 512² = 4 × 262,144 = 1,048,576 parameters
12 blocks: 12 × 1,048,576 ≈ 12.6 million parameters (Attention only)
Plus: Token Embedding table, FFN layers, Layer Norms, and output projection

After enough training steps, these parameters settle into values that make the model capable of coherent text generation.

12.8.3 Model Scale Reference

Model	Layers	d_model	Total Parameters
GPT-2 Small	12	768	117M
GPT-2 Medium	24	1024	345M
GPT-2 Large	36	1280	774M
LLaMA-7B	32	4096	7B
LLaMA-70B	80	8192	70B

Chapter Checklist

After this chapter, you should be able to:

Describe the four-dimensional shape [batch, heads, seq_len, d_head] and what each dimension means.
Explain what Concatenate does and why the split-then-merge is useful.
Explain what Wo does and how weights are shared within and across blocks.
Describe what Q × K computes (attention percentages).
Explain what (Q × K) × V does (updating token vectors using attention weights).
Describe the two things training simultaneously adjusts: token embeddings and weight matrices.

See You in the Next Chapter

That is enough for this chapter. If you can explain what every output dimension of the Attention tensor represents — which head, which token, which subspace — without looking at the diagram, you are ready for Chapter 13.

Chapter 13 covers residual connections and Dropout — the engineering tricks that let deep Transformers train stably. Now that you know what Attention produces, you will immediately see why the residual connection pattern makes sense.