One-sentence summary: before diving into details, first build a map. Know where data enters, what it passes through, and where the prediction comes out.

3.1 Why Start With A Map?

Chapter 2 reduced a large model to:

parameters + inference code

But what does that inference code actually do?

This chapter does not explain every component deeply. It gives you the overview. Think of it as looking at a map before walking the city. Once you know the major landmarks, every later chapter has a place to attach.

After this chapter, you should have a mental model for:

what the input is
what transformations happen in the middle
what the output is
why the same block repeats many times

3.2 Start With The Simplified Flow

From bottom to top, the data moves through seven stages.

3.2.1 Raw Text

The input begins as text:

The agent opened a pull request.

Computers do not understand text directly. They need numbers.

3.2.2 Token IDs

The first step converts text into token IDs:

"The agent opened a pull request."
-> [791, 8479, 9107, 264, 6958, 1715, 13]

This process is called Tokenization. Chapter 4 will focus on it.

3.2.3 Token Vectors and Position

Token IDs are still just integers. The model looks them up in an embedding table and turns each token into a vector.

Then it adds position information, because order matters:

"The agent tagged the reviewer" != "The reviewer tagged the agent"

3.2.4 Attention

Attention lets tokens look at other tokens and decide what matters.

For example, when the model processes request, it may need to pay attention to pull, opened, and agent. Attention is the mechanism that computes those relationships.

3.2.5 Normalize and Process

The numbers flowing through a neural network can become too large or unstable. LayerNorm keeps values in a reasonable range.

The Feed Forward Network then processes each position further. If Attention is about relationships between tokens, the FFN is about transforming each token's internal representation.

3.2.6 Probabilities

At the end, the model produces a score for every token in the vocabulary. Softmax turns those scores into probabilities.

3.2.7 Next Token

The model picks or samples a next token from those probabilities. Then the autoregressive loop from Chapter 2 repeats.

The simplified flow is:

text -> token IDs -> vectors + position -> repeated blocks -> probabilities -> next token

3.3 The Standard Architecture

Now let us move from the simplified map to the standard architecture.

Standard decoder-only Transformer architecture

3.3.1 Inputs

The model receives a sequence of token IDs.

3.3.2 Token Embeddings

Each token ID is mapped to a vector. Similar tokens can eventually live near each other in vector space. For example, the vectors for pull request and code review should be closer than the vectors for pull request and playlist.

3.3.3 Positional Information

Transformer blocks do not naturally know sequence order. Position information tells the model which token came first, second, third, and so on.

3.3.4 Masked Multi-Head Attention

This is the core component.

Masked means the model cannot look into the future while predicting the next token.
Multi-head means the model uses several Attention views in parallel.
Attention means tokens compute how strongly they should use information from other tokens.

3.3.5 LayerNorm and Residual Connections

LayerNorm stabilizes numbers. Residual connections let information skip around a block instead of being forced through every transformation.

We will study both later.

3.3.6 Feed Forward Network

The FFN is a small neural network applied at each token position. It expands and transforms the representation, then projects it back to the model dimension.

3.3.7 Repeat N Times

One Transformer block is useful. Many stacked Transformer blocks are powerful.

Small models might use 12 layers. Larger models may use dozens. Each layer refines the representation.

3.3.8 Linear and Softmax

The last hidden vector is mapped to vocabulary-sized scores. If the vocabulary has 100,256 tokens, the output has 100,256 scores.

Softmax converts those scores into probabilities.

3.4 A More Detailed Map

This diagram has more information, but do not try to memorize it yet. The goal is to recognize the main zones.

3.4.1 Attention Internals

Attention begins with the input X.

The model multiplies X by three learned matrices:

WQ produces Query
WK produces Key
WV produces Value

Then:

Q and K are multiplied to measure similarity.
The scores are scaled.
A mask prevents future-token leakage.
Softmax turns scores into attention weights.
The weights are applied to V.
Multiple heads are concatenated.
WO projects the combined result.

This is the heart of the book. Chapters 8-12 will unpack it slowly.

3.4.2 The Decoder Block

The decoder block wraps Attention with normalization, residual paths, and the Feed Forward Network.

A simplified block is:

input
-> masked multi-head attention
-> add + layer norm
-> feed forward
-> add + layer norm
-> output

3.4.3 The LM Head

The Language Model Head maps hidden vectors back into vocabulary space:

hidden vector -> logits over vocabulary -> probabilities

This is how the model turns internal state into a next-token prediction.

3.5 How The Three Maps Relate

Map	Detail	Best use
Simplified flow	low	explain the system to a non-specialist
Standard architecture	medium	read papers and understand model diagrams
Detailed map	high	connect implementation to the architecture

All three describe the same system. They differ only in resolution.

A useful analogy:

simplified map = country map
standard map = city map
detailed map = street map

3.6 Component Preview

The rest of the book walks through the map piece by piece.

3.6.1 Core Components

Chapter	Component	One-line explanation
Chapter 4	Tokenization	text -> token IDs
Chapter 5	Positional Encoding	add order information
Chapter 6	LayerNorm and Softmax	stabilize numbers and turn scores into probabilities
Chapter 7	Neural network layers	process representations

3.6.2 Attention

Chapter	Component	One-line explanation
Chapter 8	Linear transforms	understand matrix multiplication geometrically
Chapter 9	Attention geometry	why dot product measures similarity
Chapter 10	Q, K, V	what query, key, and value mean
Chapter 11	Multi-head attention	why multiple views help
Chapter 12	Attention output	what Attention is actually updating

3.6.3 Full Architecture

Chapter	Component	One-line explanation
Chapter 13	Residuals and Dropout	stabilize deep training
Chapter 14	Embeddings plus position	understand input representation deeply
Chapter 15	Full forward pass	connect every component
Chapter 16	Training vs inference	understand the two operating modes

3.7 Chapter Summary

3.7.1 The Core Flow

input text
    |
Tokenization
    |
Embedding
    |
Position information
    |
Transformer block x N
    |
Linear projection
    |
Softmax
    |
next token

3.7.2 Terms to Remember

Term	Role
Tokenization	converts text to token IDs
Embedding	converts token IDs to vectors
Positional Encoding	adds order information
Multi-Head Attention	learns relationships between tokens
LayerNorm	stabilizes numeric ranges
Feed Forward	processes each token representation
Residual Connection	preserves information across layers
Softmax	converts scores to probabilities

3.7.3 Core Takeaway

A Transformer is structurally simple: input processing, repeated blocks, output prediction. The block has two main jobs: Attention learns relationships; FFN processes information.

Chapter Checklist

After this chapter, you should be able to:

Draw the simplified Transformer flow.
Name the main components in a decoder-only Transformer.
Explain how data moves from input text to next-token probabilities.
Place future chapters on the overall map.

See You in the Next Chapter

That is enough map-reading. If you can redraw the pipeline from text to probabilities on a whiteboard, you are ready to zoom into the first component.

Now we start Part 2: core components.

Chapter 4 explains Tokenization: how text becomes numbers, why English and Chinese tokenize differently, and why models count tokens instead of words.