One-sentence summary: before diving into details, first build a map. Know where data enters, what it passes through, and where the prediction comes out.
3.1 Why Start With A Map?
Chapter 2 reduced a large model to:
parameters + inference code
But what does that inference code actually do?
This chapter does not explain every component deeply. It gives you the overview. Think of it as looking at a map before walking the city. Once you know the major landmarks, every later chapter has a place to attach.
After this chapter, you should have a mental model for:
- what the input is
- what transformations happen in the middle
- what the output is
- why the same block repeats many times
3.2 Start With The Simplified Flow
From bottom to top, the data moves through seven stages.
3.2.1 Raw Text
The input begins as text:
The agent opened a pull request.
Computers do not understand text directly. They need numbers.
3.2.2 Token IDs
The first step converts text into token IDs:
"The agent opened a pull request."
-> [791, 8479, 9107, 264, 6958, 1715, 13]
This process is called Tokenization. Chapter 4 will focus on it.
3.2.3 Token Vectors and Position
Token IDs are still just integers. The model looks them up in an embedding table and turns each token into a vector.
Then it adds position information, because order matters:
"The agent tagged the reviewer" != "The reviewer tagged the agent"
3.2.4 Attention
Attention lets tokens look at other tokens and decide what matters.
For example, when the model processes request, it may need to pay attention to pull, opened, and agent. Attention is the mechanism that computes those relationships.
3.2.5 Normalize and Process
The numbers flowing through a neural network can become too large or unstable. LayerNorm keeps values in a reasonable range.
The Feed Forward Network then processes each position further. If Attention is about relationships between tokens, the FFN is about transforming each token's internal representation.
3.2.6 Probabilities
At the end, the model produces a score for every token in the vocabulary. Softmax turns those scores into probabilities.
3.2.7 Next Token
The model picks or samples a next token from those probabilities. Then the autoregressive loop from Chapter 2 repeats.
The simplified flow is:
text -> token IDs -> vectors + position -> repeated blocks -> probabilities -> next token
3.3 The Standard Architecture
Now let us move from the simplified map to the standard architecture.
3.3.1 Inputs
The model receives a sequence of token IDs.
3.3.2 Token Embeddings
Each token ID is mapped to a vector. Similar tokens can eventually live near each other in vector space. For example, the vectors for pull request and code review should be closer than the vectors for pull request and playlist.
3.3.3 Positional Information
Transformer blocks do not naturally know sequence order. Position information tells the model which token came first, second, third, and so on.
3.3.4 Masked Multi-Head Attention
This is the core component.
- Masked means the model cannot look into the future while predicting the next token.
- Multi-head means the model uses several Attention views in parallel.
- Attention means tokens compute how strongly they should use information from other tokens.
3.3.5 LayerNorm and Residual Connections
LayerNorm stabilizes numbers. Residual connections let information skip around a block instead of being forced through every transformation.
We will study both later.
3.3.6 Feed Forward Network
The FFN is a small neural network applied at each token position. It expands and transforms the representation, then projects it back to the model dimension.
3.3.7 Repeat N Times
One Transformer block is useful. Many stacked Transformer blocks are powerful.
Small models might use 12 layers. Larger models may use dozens. Each layer refines the representation.
3.3.8 Linear and Softmax
The last hidden vector is mapped to vocabulary-sized scores. If the vocabulary has 100,256 tokens, the output has 100,256 scores.
Softmax converts those scores into probabilities.
3.4 A More Detailed Map
This diagram has more information, but do not try to memorize it yet. The goal is to recognize the main zones.
3.4.1 Attention Internals
Attention begins with the input X.
The model multiplies X by three learned matrices:
WQproduces QueryWKproduces KeyWVproduces Value
Then:
QandKare multiplied to measure similarity.- The scores are scaled.
- A mask prevents future-token leakage.
- Softmax turns scores into attention weights.
- The weights are applied to
V. - Multiple heads are concatenated.
WOprojects the combined result.
This is the heart of the book. Chapters 8-12 will unpack it slowly.
3.4.2 The Decoder Block
The decoder block wraps Attention with normalization, residual paths, and the Feed Forward Network.
A simplified block is:
input
-> masked multi-head attention
-> add + layer norm
-> feed forward
-> add + layer norm
-> output
3.4.3 The LM Head
The Language Model Head maps hidden vectors back into vocabulary space:
hidden vector -> logits over vocabulary -> probabilities
This is how the model turns internal state into a next-token prediction.
3.5 How The Three Maps Relate
| Map | Detail | Best use |
|---|---|---|
| Simplified flow | low | explain the system to a non-specialist |
| Standard architecture | medium | read papers and understand model diagrams |
| Detailed map | high | connect implementation to the architecture |
All three describe the same system. They differ only in resolution.
A useful analogy:
- simplified map = country map
- standard map = city map
- detailed map = street map
3.6 Component Preview
The rest of the book walks through the map piece by piece.
3.6.1 Core Components
| Chapter | Component | One-line explanation |
|---|---|---|
| Chapter 4 | Tokenization | text -> token IDs |
| Chapter 5 | Positional Encoding | add order information |
| Chapter 6 | LayerNorm and Softmax | stabilize numbers and turn scores into probabilities |
| Chapter 7 | Neural network layers | process representations |
3.6.2 Attention
| Chapter | Component | One-line explanation |
|---|---|---|
| Chapter 8 | Linear transforms | understand matrix multiplication geometrically |
| Chapter 9 | Attention geometry | why dot product measures similarity |
| Chapter 10 | Q, K, V | what query, key, and value mean |
| Chapter 11 | Multi-head attention | why multiple views help |
| Chapter 12 | Attention output | what Attention is actually updating |
3.6.3 Full Architecture
| Chapter | Component | One-line explanation |
|---|---|---|
| Chapter 13 | Residuals and Dropout | stabilize deep training |
| Chapter 14 | Embeddings plus position | understand input representation deeply |
| Chapter 15 | Full forward pass | connect every component |
| Chapter 16 | Training vs inference | understand the two operating modes |
3.7 Chapter Summary
3.7.1 The Core Flow
input text
|
Tokenization
|
Embedding
|
Position information
|
Transformer block x N
|
Linear projection
|
Softmax
|
next token
3.7.2 Terms to Remember
| Term | Role |
|---|---|
| Tokenization | converts text to token IDs |
| Embedding | converts token IDs to vectors |
| Positional Encoding | adds order information |
| Multi-Head Attention | learns relationships between tokens |
| LayerNorm | stabilizes numeric ranges |
| Feed Forward | processes each token representation |
| Residual Connection | preserves information across layers |
| Softmax | converts scores to probabilities |
3.7.3 Core Takeaway
A Transformer is structurally simple: input processing, repeated blocks, output prediction. The block has two main jobs: Attention learns relationships; FFN processes information.
Chapter Checklist
After this chapter, you should be able to:
- Draw the simplified Transformer flow.
- Name the main components in a decoder-only Transformer.
- Explain how data moves from input text to next-token probabilities.
- Place future chapters on the overall map.
See You in the Next Chapter
That is enough map-reading. If you can redraw the pipeline from text to probabilities on a whiteboard, you are ready to zoom into the first component.
Now we start Part 2: core components.
Chapter 4 explains Tokenization: how text becomes numbers, why English and Chinese tokenize differently, and why models count tokens instead of words.