One-sentence summary: Matrix multiplication has a geometric meaning — it projects vectors — and once you see that, the dot product at the heart of Attention is no longer mysterious.
8.1 Why Learn This Before Attention?
Attention is full of matrix multiplication. If you only think about it as rows times columns, the QKV mechanics will feel like symbol-pushing. If you see it as geometry, the architecture clicks.
Here is a map of where matrix multiplication appears in the Transformer:
- Embedding lookup (token ID → d_model-dimensional vector)
- Q, K, V projection matrices in Attention
- FFN expand and contract layers
- Final vocabulary projection (LM Head)
Matrix multiplication is everywhere. Understanding it geometrically is the single highest-leverage thing you can do before Chapter 9.
8.2 Scalars, Vectors, and Matrices
Before going further, let's fix the vocabulary.
8.2.1 Scalar
A scalar is a single number.
5
Temperature, learning rate, attention score at one position — these are all scalars.
8.2.2 Vector
A vector is an ordered list of numbers.
[3, 2, 9, 84]
Vectors can represent almost anything: a 3D position [x, y, z], an RGB color [255, 128, 0], or a token's semantic representation in a 4096-dimensional space. The key property is that the order matters.
8.2.3 Matrix
A matrix is a 2D table of numbers.
3 × 4 matrix:
┌─────────────────┐
│ □ □ □ □ │
│ □ □ □ □ │
│ □ □ □ □ │
└─────────────────┘
You can think of a matrix as a stack of row vectors, or equivalently as a collection of column vectors.
8.2.4 In the Transformer
| Object | Example |
|---|---|
| Scalar | learning rate, temperature, one attention score |
| Vector | one token's embedding (shape: [d_model]) |
| Matrix | all token embeddings at once (shape: [seq_len, d_model]) or a weight matrix [d_model, d_model] |
The token representation flowing through the Transformer is fundamentally a matrix of shape [seq_len, d_model] — one row per token, one column per feature dimension.
8.3 Matrix Multiplication: The Computation
8.3.1 Dimension Rule
The dimension rule for matrix multiplication:
[A, B] × [B, C] = [A, C]
The inner dimensions must match (both B). The output shape is the two outer dimensions.
8.3.2 Worked Example
Let's compute a [4, 3] × [3, 4] multiplication. The result is [4, 4].
For the first element of the result (row 0, column 0):
row 0 of left matrix: [0.2, 0.4, 0.5]
col 0 of right matrix: [2, 1, 7]
dot product: 0.2×2 + 0.4×1 + 0.5×7
= 0.4 + 0.4 + 3.5
= 4.3
The fundamental operation is the dot product: multiply corresponding elements and sum.
In Python/NumPy/PyTorch:
C = A @ B # @ is the matrix multiplication operator
8.3.3 Why "Dot Product"?
The name comes from the mathematical notation A · B. For two vectors of the same length:
A · B = a₁b₁ + a₂b₂ + a₃b₃ + ... + aₙbₙ
A matrix multiply is just many dot products organized into a grid.
8.4 Two Ways to Think About the Same Operation
The same operation has two useful frames.
8.4.1 Frame One: Dot Product (Matrix × Matrix)
[4, 3] × [3, 4] = [4, 4]
Two matrices multiply. Each element of the output is a dot product between a row of the left matrix and a column of the right matrix.
This frame is useful when both operands contain data — for example, computing all pairwise similarities between token vectors.
8.4.2 Frame Two: Linear Transformation (Matrix × Vector)
[4, 3] × [3, 1] = [4, 1]
A weight matrix transforms a single vector: input dimension changes from 3 to 4.
This frame is useful when one operand is data and the other is a learned weight matrix. The weight matrix defines a learned transformation of the vector space.
8.4.3 Linear Transformation Intuition
"Linear transformation" sounds technical. The geometric idea is simple:
A weight matrix moves a vector from one space to another — possibly changing its dimension, rotating it, stretching it, or projecting it down.
In the Transformer:
- The embedding table maps token IDs (integers) into
d_model-dimensional space. - The Q, K, V weight matrices move
d_modelvectors into a differentd_model(ord_key) space, emphasizing different aspects. - The FFN expand layer moves vectors from
d_modelinto4 × d_modelspace.
Linear transformations are everywhere because vectors in different "views" of the same data are what the model learns to compare.
8.5 Geometric Meaning: Vector Space
Now for the part that makes Attention click.
8.5.1 Word Vectors in 3D
Suppose we have a simplified vocabulary with three tokens, each represented as a 3D vector:
cat = [7, 7, 6]
fish = [6, 4, 5]
love = [-4, -2, 1]
Plot these as arrows from the origin in 3D space:
catandfishpoint in roughly the same direction — they are both concrete nouns.lovepoints in a very different direction — it is an abstract verb.
This direction similarity is meaningful. The model learns to place semantically related tokens in similar directions.
8.5.2 Matrix Multiplication Computes Similarity
Look at what happens when we multiply the full token matrix by a single vector:
token matrix [n, d] @ query vector [d, 1] = similarity scores [n, 1]
Each element of the output is the dot product between one token's vector and the query. The dot product is large when the two vectors point in similar directions, and small (or negative) when they point in opposite directions.
A concrete example with three tokens scored against the query token "PR":
| Token pair | Dot product score | Interpretation |
|---|---|---|
agent · PR | 100 | high similarity — both central to a code review workflow |
merged · PR | 100 | high similarity — merge is the direct action on a PR |
playlist · PR | 22 | low similarity — unrelated domain |
The model learns these directions from training. The numbers are not hand-coded; they emerge from exposure to text where agent, merged, and PR appear together frequently, while playlist does not.
This is why matrix multiplication shows up inside Attention. One operation computes all pairwise similarities between tokens.
8.5.3 The d_model Dimension
d_model is the number of dimensions in each token's representation:
| Model | d_model |
|---|---|
| GPT-2 Small | 768 |
| GPT-2 Large | 1,280 |
| GPT-3 | 12,288 |
| LLaMA-7B | 4,096 |
More dimensions means a richer representation — more "directions" available to encode distinctions. It also means larger weight matrices and more computation.
8.6 Dot Product as Cosine Similarity
8.6.1 The Angle Between Vectors
The dot product relates to the angle between vectors through a formula:
Rearranging:
Where:
|A|is the length (magnitude) of vector A.|B|is the length of vector B.θis the angle between them.
8.6.2 Geometric Intuitions
| Situation | cos(θ) | dot product | Interpretation |
|---|---|---|---|
| Same direction | ≈ 1 | large positive | very similar |
| 90° apart | 0 | ≈ 0 | unrelated |
| Opposite directions | -1 | negative | opposing |
This gives us a clean geometric reading of the dot product: it measures how much two vectors agree in direction.
8.6.3 A Concrete Example
A = "this" = [3, 5]
B = "a" = [1, 4]
Compute:
A · B = 3×1 + 5×4 = 3 + 20 = 23
|A| = √(9 + 25) = √34 ≈ 5.83
|B| = √(1 + 16) = √17 ≈ 4.12
cos(θ) = 23 / (5.83 × 4.12) ≈ 23 / 24.0 ≈ 0.96
These two vectors have a cosine similarity of 0.96 — nearly parallel, highly similar.
8.6.4 This Is the Core of Attention
In Attention:
- A Query vector asks: "What am I looking for?"
- A Key vector says: "Here is what I contain."
- Their dot product measures whether the Query's question matches the Key's advertisement.
High dot product → high similarity → high attention weight after Softmax.
Attention is dot-product similarity applied to learned vector representations. Everything else is engineering around this idea.
8.7 Projection: A Second Geometric View
8.7.1 What Projection Means
The dot product has a second geometric interpretation: projection.
A · B = |A| × (length of B's shadow projected onto A's direction)
Or equivalently:
A · B = |B| × (length of A's shadow projected onto B's direction)
Projection asks: how much of one vector's "content" lies in the direction of another?
8.7.2 The Projection Picture
In a 2D sketch:
- Draw vector A (red arrow).
- Draw vector B (blue arrow).
- Drop a perpendicular from the tip of B onto the line defined by A.
- The length from the origin to that foot is the projection of B onto A.
The dot product equals |A| times that projection length.
8.7.3 Why This Matters for Language
In high-dimensional token space:
- "king" and "monarchy" have a large projection onto each other — they strongly share a "royalty" component.
- "king" and "algorithm" have a small projection — they share little in common.
The model doesn't have explicit dimensions labeled "royalty" or "abstractness." It learns directions in space that capture these distinctions from training data. Matrix multiplication — dot products — is how it measures alignment with those learned directions.
8.8 Connecting Back to Attention
8.8.1 The Attention Formula Preview
The core of Attention (details in Chapter 9):
The term QK^T is a matrix multiply — it computes dot products between every Query vector and every Key vector simultaneously. The result is a matrix of similarity scores.
8.8.2 Reading Q, K, V Geometrically
Q = X WQ
K = X WK
V = X WV
Read this as: take the input X and view it through three learned geometric lenses. Each projection matrix WQ, WK, WV rotates and stretches the same data into a different coordinate system:
WQprojects into a "what am I looking for" space.WKprojects into a "what do I advertise" space.WVprojects into a "what information do I contribute" space.
The dot product between Q and K vectors then measures alignment between these two projected spaces. High alignment → high attention weight → the model blends more of that token's Value into the output.
8.8.3 Summary of the Geometric Reading
| Math | Geometric meaning | Role in Attention |
|---|---|---|
A · B | similarity / projection | measures Query-Key match |
matrix multiply AB | batch dot products | computes all pairwise scores at once |
| Softmax | normalize to probabilities | converts scores into weights |
8.9 Chapter Summary
8.9.1 Key Concepts
| Concept | Meaning |
|---|---|
| Scalar | a single number |
| Vector | an ordered list of numbers; represents a point or direction |
| Matrix | a 2D table; represents a transformation or a batch of vectors |
| Dot product | element-wise multiply and sum; measures vector alignment |
| Linear transformation | using a weight matrix to rotate/stretch/project a vector |
| Cosine similarity | dot product normalized by vector lengths; pure angle measure |
| Projection | how much of one vector lies in the direction of another |
8.9.2 Key Formulas
Dot product:
A · B = a₁b₁ + a₂b₂ + ... + aₙbₙ
Cosine similarity:
Projection of B onto A:
A · B = |A| × (B projected onto A)
Matrix multiply dimension rule:
[A, B] × [B, C] = [A, C]
8.9.3 Core Takeaway
Matrix multiplication is not abstract symbol-pushing. Geometrically, it measures how much two vectors point in the same direction. Attention uses this directly: the dot product between a Query and a Key vector says whether they "match." High match → high attention weight. That is the whole idea.
Chapter Checklist
After this chapter, you should be able to:
- State the dimension rule for matrix multiplication and compute a small example by hand.
- Explain the dot product as a measurement of vector alignment.
- Explain projection in plain English: how much of one vector lies in another's direction.
- Explain why matrix multiply is the right tool for computing all pairwise similarities between token vectors.
- Connect the dot product to the Query-Key matching inside Attention.
See You in the Next Chapter
That is the geometry. If you can draw two arrows, say "their dot product is large," and explain why that leads to a high attention weight, you are ready for Chapter 9.
Chapter 9 closes the loop: we put the geometric intuition together with the actual Attention formula, look at what attention heatmaps reveal, and answer the question of why dot product specifically — rather than some other similarity measure — became the standard choice.