One-sentence summary: Attention is a learned routing system where each token computes its similarity to every other token using dot products, then blends together the most relevant representations.

9.1 Review: What We Know So Far

Before opening up Attention, let's map what we have built:

Chapter	Concept	Core role
4	Tokenization + Embedding	text → token ID → vector
5	Positional Encoding	adds location information to each vector
6	LayerNorm + Softmax	stabilizes activations; converts scores to probabilities
7	Feed Forward Network	transforms per-token representations; stores learned knowledge
8	Linear Transforms	matrix multiply = dot product = similarity / projection

We now have all the prerequisites. This chapter finally opens up the Attention mechanism — the part that makes the Transformer genuinely different from everything that came before it.

9.2 Attention in the Architecture

Full Transformer block showing Multi-Head Attention and the Scaled Dot-Product Attention detail

9.2.1 The Transformer Block Structure

Each Transformer block contains two major sub-layers:

Input
  ↓
LayerNorm
  ↓
Masked Multi-Head Attention   <- this chapter
  ↓
Residual connection
  ↓
LayerNorm
  ↓
Feed Forward Network (FFN)
  ↓
Residual connection
  ↓
Output

Attention is the first sub-layer in each block. It is where the model decides which tokens should influence which other tokens.

9.2.2 The Internal Flow of Scaled Dot-Product Attention

Inside the Attention sub-layer (details in Chapter 10, overview here):

Input X
  ↓
Project X into Q, K, V using learned weight matrices Wq, Wk, Wv
  ↓
Compute Q @ Kᵀ  (pairwise similarity scores)
  ↓
Scale by 1/√d_key
  ↓
Mask (decoder only: prevent attending to future positions)
  ↓
Softmax (convert scores into attention weights)
  ↓
Weighted sum of V using attention weights
  ↓
Concatenate across heads
  ↓
Output projection Wo
  ↓
Output

This chapter focuses on the geometric intuition: why dot product, what do the heatmaps show, and what does "attention weight" actually mean. Chapter 10 covers Q, K, and V in detail.

9.3 Why Attention Exists

9.3.1 The Core Problem in Language Understanding

Consider this sentence:

"The agent opened a pull request and the reviewer left a comment on it."

When the model processes "it" at the end, it needs to know that "it" refers to "pull request," not to "reviewer" or "agent." To resolve that reference, the model needs to connect "it" with something several tokens back.

Every word's meaning depends on its relationship to other words. Resolving references, understanding subject-verb agreement, tracking long-range dependencies — all of these require information flow across positions in the sequence.

9.3.2 Why RNN Struggled with This

Before the Transformer, the standard approach for sequence processing was the Recurrent Neural Network (RNN):

token₁ → token₂ → token₃ → token₄ → token₅ → ...
              ↘         ↘         ↘
              hidden state flows forward

Problems with RNNs:

Sequential computation: the model must process token 1 before token 2, token 2 before token 3. No parallelism. Training on long sequences is slow.
Long-range dependency decay: information from token 1 must survive many hidden-state transitions to influence token 100. In practice, it often doesn't. The model forgets long-ago context.
Gradient problems: backpropagating through long sequences leads to vanishing or exploding gradients, making training difficult.

9.3.3 Attention's Solution

Attention gives every token direct access to every other token:

         token₁  token₂  token₃  token₄  token₅
token₁     ↔       ↔       ↔       ↔       ↔
token₂     ↔       ↔       ↔       ↔       ↔
token₃     ↔       ↔       ↔       ↔       ↔
token₄     ↔       ↔       ↔       ↔       ↔
token₅     ↔       ↔       ↔       ↔       ↔

No intermediate state. No distance decay. Token 1 and token 100 are equally reachable from token 50 in a single Attention operation.

The analogy: RNNs process sequences like a phone chain — information passes one person to the next, degrading along the way. Attention is a direct broadcast: every token can hear every other token simultaneously.

9.4 Dot Product as a Similarity Tool

9.4.1 Recap from Chapter 8

In Chapter 8 we established:

The dot product of two vectors is large when they point in similar directions, small when they are unrelated, and negative when they point in opposite directions.

This is exactly the tool Attention needs: a fast, differentiable way to ask "how relevant is token j to token i?"

Suppose each token has a vector representation (simplified to 3D for illustration):

agent    = [0.2, 0.8, 0.3]
opened   = [0.3, 0.7, 0.4]
pull     = [0.1, 0.9, 0.2]
request  = [0.8, 0.2, 0.7]

To find which tokens are most related to "request", compute dot products:

request · agent   = 0.8×0.2 + 0.2×0.8 + 0.7×0.3 = 0.16 + 0.16 + 0.21 = 0.53
request · opened  = 0.8×0.3 + 0.2×0.7 + 0.7×0.4 = 0.24 + 0.14 + 0.28 = 0.66
request · pull    = 0.8×0.1 + 0.2×0.9 + 0.7×0.2 = 0.08 + 0.18 + 0.14 = 0.40

The scores tell us "request" is most similar to "opened" (0.66), then "agent" (0.53), then "pull" (0.40). In this toy example, that might not be the perfect semantic ranking — but the real model learns Q, K, V projections that make these scores meaningful for the task. The point is the mechanism.

9.4.3 Matrix Multiply Computes All Similarities at Once

Computing similarities one pair at a time is slow. Matrix multiplication does them all in one GPU operation:

token matrix [n, d] @ token matrix transposed [d, n] = similarity matrix [n, n]

The (i, j) entry of the result is the dot product between token i and token j. For a sequence of length 512, this produces a 512×512 similarity matrix in one call. GPU kernels are optimized for exactly this operation.

9.5 Attention Heatmaps: Visualizing the Similarity Matrix

9.5.1 What Q @ K Produces

Attention heatmap: a 16×16 matrix showing which tokens attend to which

After computing Q @ Kᵀ and applying Softmax, we get an attention weight matrix. For a sequence of 16 tokens, this is a 16×16 matrix. Each row is a probability distribution over the 16 positions.

This matrix can be visualized as a heatmap:

X-axis (columns): the Key positions — which token is being attended to.
Y-axis (rows): the Query positions — which token is doing the attending.
Color: bright (yellow) = high attention weight; dark = low weight.

9.5.2 What to Look For in a Heatmap

Typical patterns:

Bright diagonal: each token attends strongly to itself. Expected and sensible.
Bright off-diagonal cells: a token at row i attends strongly to position j. This suggests the model learned that j's information is relevant for understanding i.
Uneven column brightness: some tokens receive high attention from many positions — they are important anchors for the whole sequence.

9.5.3 What Heatmaps Do Not Tell You

Attention heatmaps are useful for intuition and debugging. But they are not a complete explanation of what the model "understands."

A single heatmap shows one Attention head in one layer for one input. A typical model has 32 layers and 32 heads per layer — 1,024 attention patterns per forward pass. Different heads specialize in different patterns. Looking at one heatmap and saying "the model pays attention to X" is like looking at one column of a 1,024-column spreadsheet and summarizing the whole thing.

Use heatmaps for intuition. Do not over-interpret them.

9.6 From Similarity Scores to Attention Weights

9.6.1 The Problem with Raw Dot Products

Raw dot product scores have no fixed range:

raw scores: [3.5, -2.1, 8.7, 0.3, ...]

These scores can be positive or negative, and their absolute scale depends on the vector magnitudes. We cannot interpret them as "how much attention to pay" without normalizing.

9.6.2 Softmax to the Rescue

Softmax (Chapter 6) converts arbitrary scores into a valid probability distribution:

Softmax([3.5, -2.1, 8.7, 0.3]) ≈ [0.05, 0.00, 0.94, 0.01]

Now:

Every weight is between 0 and 1.
All weights sum to 1.
The highest raw score gets the largest weight.
The interpretation is clear: "94% of the attention budget goes to position 2."

9.6.3 The Scaling Step: Why Divide by √d_key

In the full formula, there is a scaling step before Softmax:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Why divide by √d_k?

When vectors have high dimension (say d_k = 512), dot products can become very large. Each of the 512 terms contributes a product of two numbers; the sum can easily reach hundreds.

Large inputs to Softmax cause the distribution to become extremely peaked:

Softmax([100, 50, 40]) ≈ [1.0, 0.0, 0.0]  <- almost deterministic
Softmax([5.0, 2.5, 2.0]) ≈ [0.85, 0.10, 0.06]  <- healthier

When Softmax is almost-deterministic, gradients nearly vanish. Training slows or stops.

Dividing by √d_k (e.g., √512 ≈ 22.6) brings the scores back to a reasonable scale before Softmax. It is a cheap but important normalization.

9.7 The Complete Attention Formula

9.7.1 The Formula

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

9.7.2 Step-by-Step Breakdown

Step 1: Q @ Kᵀ

Shape: [seq_len, d_k] @ [d_k, seq_len] = [seq_len, seq_len]
Each entry (i, j) is the dot product between Query token i and Key token j.
Interpretation: "How much does token i want to attend to token j?"

Step 2: / √d_k

Scalar division.
Keeps the scores in a range where Softmax has healthy gradients.

Step 3: Softmax

Applied row-by-row.
Each row becomes a probability distribution over the sequence.
Entry (i, j) becomes the attention weight: "What fraction of token i's attention budget goes to token j?"

Step 4: × V

Shape: [seq_len, seq_len] @ [seq_len, d_v] = [seq_len, d_v]
Each output token is a weighted sum of all Value vectors.
Tokens with high attention weight contribute more to the output.

9.7.3 An Analogy: Search with Weighted Results

Think of Attention as a search system:

Query (Q): the search query — "what am I looking for?"
Key (K): the index entry for each document — "what does this token advertise?"
Similarity (Q @ Kᵀ): matching scores — how well does each token match the query?
Softmax: normalize scores into a distribution over results.
Value (V): the actual content of each document — "what information do I contribute if selected?"
Output (attention_weights @ V): a blend of all documents, weighted by relevance.

Unlike a search engine that returns discrete ranked results, Attention returns a soft weighted blend. Every token contributes something; the weights just determine how much.

9.8 Why Dot Product Specifically?

9.8.1 Computational Efficiency

Dot product is expressible as matrix multiplication. Matrix multiplication on modern GPUs is extraordinarily well-optimized — libraries like cuBLAS and Tensor Cores are built for exactly this. A single Q @ Kᵀ call computes all seq_len² pairwise similarities in parallel.

# One line computes all pairwise similarities
attention_scores = Q @ K.transpose(-2, -1)

Alternative similarity functions (Euclidean distance, cosine similarity with explicit normalization) are more expensive and not as naturally expressed as a single matrix multiply.

9.8.2 Geometric Clarity

From Chapter 8: dot product measures vector alignment. Query "asks a question" in a certain direction. Key "answers" in some direction. Their dot product says whether the answer direction matches the question direction.

This is not just a convenient metaphor. The model actually learns Wq and Wk matrices such that Q vectors that should match specific K vectors end up pointing in similar directions. The geometry is real.

9.8.3 Learned Flexibility

Although the dot product operation is fixed, Q, K, and V are learned projections of the input:

Q = X @ Wq   (shape: [seq_len, d_k])
K = X @ Wk   (shape: [seq_len, d_k])
V = X @ Wv   (shape: [seq_len, d_v])

The model learns Wq, Wk, Wv during training. This means the model can learn:

Which aspects of a token's representation should be used when asking a question (Q).
Which aspects should be advertised to other tokens (K).
What information to contribute when selected (V).

The dot product itself is a fixed operation, but the projected spaces it operates in are fully learned. This combination of a simple fixed operation with rich learned projections is what makes Attention so powerful.

9.9 Self-Attention vs. Cross-Attention

9.9.1 Self-Attention

In a decoder-only model (GPT, LLaMA, Claude), Q, K, and V all come from the same input sequence:

input: "The agent opened a pull request."

Q = input @ Wq
K = input @ Wk
V = input @ Wv

Every token attends to every other token in the same sequence. This is Self-Attention: the sequence asks questions about itself.

9.9.2 Causal Masking in Decoder Self-Attention

In a language model, the model should not be able to see future tokens when predicting the current one. If the model is generating token 5, it must not attend to tokens 6, 7, 8, ...

This is enforced by a causal mask: before Softmax, set the attention scores for all future positions to -∞. After Softmax, those positions get weight ≈ 0 and effectively do not exist.

Masked attention matrix for a 5-token sequence (lower triangle):
token 1 attends to: [1]
token 2 attends to: [1, 2]
token 3 attends to: [1, 2, 3]
token 4 attends to: [1, 2, 3, 4]
token 5 attends to: [1, 2, 3, 4, 5]

Positions above the diagonal are masked out.

9.9.3 Cross-Attention

In encoder-decoder models (original Transformer, translation models), Q comes from the decoder sequence and K, V come from the encoder's output:

Encoder input: "The agent opened a pull request."
Decoder input: "L'agent a ouvert une"

Q = decoder_hidden @ Wq
K = encoder_output @ Wk
V = encoder_output @ Wv

The decoder asks questions about the encoder's representation. This is Cross-Attention.

9.9.4 This Book's Focus

This book focuses on decoder-only models — GPT, LLaMA, and similar architectures — because that is the shape most engineers encounter first when working with modern LLMs. The Self-Attention with causal masking is what powers ChatGPT, Claude, and Gemini. Chapter 10 digs deeper into the QKV details.

9.10 Chapter Summary

9.10.1 Key Concepts

Concept	Explanation
Attention	each token directly attends to all other tokens via learned similarity
Dot product	measures vector alignment; the core similarity operation
`Q @ Kᵀ`	computes all pairwise token similarities in one matrix multiply
Scaling	divide by `√d_k` to prevent Softmax collapse when scores are large
Softmax	converts raw similarity scores into attention weights (probability distribution)
Value-weighted sum	final output blends V vectors proportional to attention weights
Self-Attention	Q, K, V from the same sequence
Cross-Attention	Q from one sequence, K/V from another
Causal mask	prevents decoder from attending to future positions

9.10.2 The Attention Formula

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

9.10.3 Core Takeaway

Attention is dot-product similarity applied to learned projections. Q asks a question, K advertises an answer, their dot product scores the match, Softmax converts scores into weights, and those weights blend the V representations. The model learns which projections make meaningful Q-K matches. That is the whole mechanism.

Chapter Checklist

After this chapter, you should be able to:

Explain why Attention gives every token direct access to every other token, and why that matters for long-range dependencies.
Explain why dot product is used to measure token-to-token similarity.
Trace through the four steps of the Attention formula: Q @ Kᵀ, scale, Softmax, and weighted sum of V.
Explain why we divide by √d_k before Softmax.
Distinguish Self-Attention (same sequence) from Cross-Attention (two sequences).
Explain what causal masking does in a decoder model.

See You in the Next Chapter

That covers the geometry of Attention. If you can explain the full pipeline — dot product scores, scaling, Softmax weights, blended Values — without looking at the formula, you are ready for what comes next.

Chapter 10 answers the question this chapter deliberately left open: what exactly are Q, K, and V? Where do they come from? What are the weight matrices Wq, Wk, and Wv learning? And why does splitting into multiple heads help?