One-sentence summary: before diving into details, first build a map. Know where data enters, what it passes through, and where the prediction comes out.


3.1 Why Start With A Map?

Chapter 2 reduced a large model to:

parameters + inference code

But what does that inference code actually do?

This chapter does not explain every component deeply. It gives you the overview. Think of it as looking at a map before walking the city. Once you know the major landmarks, every later chapter has a place to attach.

After this chapter, you should have a mental model for:

  • what the input is
  • what transformations happen in the middle
  • what the output is
  • why the same block repeats many times

3.2 Start With The Simplified Flow

Simplified Transformer flow

From bottom to top, the data moves through seven stages.

3.2.1 Raw Text

The input begins as text:

The agent opened a pull request.

Computers do not understand text directly. They need numbers.

3.2.2 Token IDs

The first step converts text into token IDs:

"The agent opened a pull request."
-> [791, 8479, 9107, 264, 6958, 1715, 13]

This process is called Tokenization. Chapter 4 will focus on it.

3.2.3 Token Vectors and Position

Token IDs are still just integers. The model looks them up in an embedding table and turns each token into a vector.

Then it adds position information, because order matters:

"The agent tagged the reviewer" != "The reviewer tagged the agent"

3.2.4 Attention

Attention lets tokens look at other tokens and decide what matters.

For example, when the model processes request, it may need to pay attention to pull, opened, and agent. Attention is the mechanism that computes those relationships.

3.2.5 Normalize and Process

The numbers flowing through a neural network can become too large or unstable. LayerNorm keeps values in a reasonable range.

The Feed Forward Network then processes each position further. If Attention is about relationships between tokens, the FFN is about transforming each token's internal representation.

3.2.6 Probabilities

At the end, the model produces a score for every token in the vocabulary. Softmax turns those scores into probabilities.

3.2.7 Next Token

The model picks or samples a next token from those probabilities. Then the autoregressive loop from Chapter 2 repeats.

The simplified flow is:

text -> token IDs -> vectors + position -> repeated blocks -> probabilities -> next token

3.3 The Standard Architecture

Now let us move from the simplified map to the standard architecture.

Standard decoder-only Transformer architecture

3.3.1 Inputs

The model receives a sequence of token IDs.

3.3.2 Token Embeddings

Each token ID is mapped to a vector. Similar tokens can eventually live near each other in vector space. For example, the vectors for pull request and code review should be closer than the vectors for pull request and playlist.

3.3.3 Positional Information

Transformer blocks do not naturally know sequence order. Position information tells the model which token came first, second, third, and so on.

3.3.4 Masked Multi-Head Attention

This is the core component.

  • Masked means the model cannot look into the future while predicting the next token.
  • Multi-head means the model uses several Attention views in parallel.
  • Attention means tokens compute how strongly they should use information from other tokens.

3.3.5 LayerNorm and Residual Connections

LayerNorm stabilizes numbers. Residual connections let information skip around a block instead of being forced through every transformation.

We will study both later.

3.3.6 Feed Forward Network

The FFN is a small neural network applied at each token position. It expands and transforms the representation, then projects it back to the model dimension.

3.3.7 Repeat N Times

One Transformer block is useful. Many stacked Transformer blocks are powerful.

Small models might use 12 layers. Larger models may use dozens. Each layer refines the representation.

3.3.8 Linear and Softmax

The last hidden vector is mapped to vocabulary-sized scores. If the vocabulary has 100,256 tokens, the output has 100,256 scores.

Softmax converts those scores into probabilities.


3.4 A More Detailed Map

Detailed Transformer architecture map

This diagram has more information, but do not try to memorize it yet. The goal is to recognize the main zones.

3.4.1 Attention Internals

Attention begins with the input X.

The model multiplies X by three learned matrices:

  • WQ produces Query
  • WK produces Key
  • WV produces Value

Then:

  1. Q and K are multiplied to measure similarity.
  2. The scores are scaled.
  3. A mask prevents future-token leakage.
  4. Softmax turns scores into attention weights.
  5. The weights are applied to V.
  6. Multiple heads are concatenated.
  7. WO projects the combined result.

This is the heart of the book. Chapters 8-12 will unpack it slowly.

3.4.2 The Decoder Block

The decoder block wraps Attention with normalization, residual paths, and the Feed Forward Network.

A simplified block is:

input
-> masked multi-head attention
-> add + layer norm
-> feed forward
-> add + layer norm
-> output

3.4.3 The LM Head

The Language Model Head maps hidden vectors back into vocabulary space:

hidden vector -> logits over vocabulary -> probabilities

This is how the model turns internal state into a next-token prediction.


3.5 How The Three Maps Relate

MapDetailBest use
Simplified flowlowexplain the system to a non-specialist
Standard architecturemediumread papers and understand model diagrams
Detailed maphighconnect implementation to the architecture

All three describe the same system. They differ only in resolution.

A useful analogy:

  • simplified map = country map
  • standard map = city map
  • detailed map = street map

3.6 Component Preview

The rest of the book walks through the map piece by piece.

3.6.1 Core Components

ChapterComponentOne-line explanation
Chapter 4Tokenizationtext -> token IDs
Chapter 5Positional Encodingadd order information
Chapter 6LayerNorm and Softmaxstabilize numbers and turn scores into probabilities
Chapter 7Neural network layersprocess representations

3.6.2 Attention

ChapterComponentOne-line explanation
Chapter 8Linear transformsunderstand matrix multiplication geometrically
Chapter 9Attention geometrywhy dot product measures similarity
Chapter 10Q, K, Vwhat query, key, and value mean
Chapter 11Multi-head attentionwhy multiple views help
Chapter 12Attention outputwhat Attention is actually updating

3.6.3 Full Architecture

ChapterComponentOne-line explanation
Chapter 13Residuals and Dropoutstabilize deep training
Chapter 14Embeddings plus positionunderstand input representation deeply
Chapter 15Full forward passconnect every component
Chapter 16Training vs inferenceunderstand the two operating modes

3.7 Chapter Summary

3.7.1 The Core Flow

input text
    |
Tokenization
    |
Embedding
    |
Position information
    |
Transformer block x N
    |
Linear projection
    |
Softmax
    |
next token

3.7.2 Terms to Remember

TermRole
Tokenizationconverts text to token IDs
Embeddingconverts token IDs to vectors
Positional Encodingadds order information
Multi-Head Attentionlearns relationships between tokens
LayerNormstabilizes numeric ranges
Feed Forwardprocesses each token representation
Residual Connectionpreserves information across layers
Softmaxconverts scores to probabilities

3.7.3 Core Takeaway

A Transformer is structurally simple: input processing, repeated blocks, output prediction. The block has two main jobs: Attention learns relationships; FFN processes information.


Chapter Checklist

After this chapter, you should be able to:

  • Draw the simplified Transformer flow.
  • Name the main components in a decoder-only Transformer.
  • Explain how data moves from input text to next-token probabilities.
  • Place future chapters on the overall map.

See You in the Next Chapter

That is enough map-reading. If you can redraw the pipeline from text to probabilities on a whiteboard, you are ready to zoom into the first component.

Now we start Part 2: core components.

Chapter 4 explains Tokenization: how text becomes numbers, why English and Chinese tokenize differently, and why models count tokens instead of words.

Cite this page
Zhang, Wayland (2026). Chapter 3: The Transformer Map. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/chapter-03-transformer-map
@incollection{zhang2026transformer_chapter_03_transformer_map,
  author = {Zhang, Wayland},
  title = {Chapter 3: The Transformer Map},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/chapter-03-transformer-map}
}