One-sentence summary: The Transformer forward pass is a pipeline — text → tokens → embeddings + position → N blocks (Attention + FFN) → linear projection to vocabulary → Softmax probabilities → predicted next token. Understand this pipeline and you understand how GPT "thinks."


15.1 The Big Picture: Decoder-Only Architecture

15.1.1 GPT, LLaMA, Claude Are All Decoder-Only

Decoder-Only architecture overview

Modern language models — GPT, LLaMA, Claude — all use a Decoder-Only architecture. Unlike the original Transformer's Encoder-Decoder design, Decoder-Only keeps only the decoder stack and focuses exclusively on autoregressive generation.

15.1.2 GPT-2 vs GPT-1 Architecture Comparison

GPT-1 vs GPT-2 architecture comparison

These two models share the Decoder-Only skeleton. The main difference is LayerNorm placement:

GPT-1 (Post-Norm)GPT-2 (Pre-Norm)
LayerNorm positionAfter Attention/FFNBefore Attention/FFN
Training stabilityLess stableMore stable
Modern modelsLLaMA, GPT-3, and nearly everything since

Pre-Norm is now the default. We'll use GPT-2 as our reference and trace the data flow from bottom to top.

15.1.3 Complete Pipeline Overview

Input text: "The agent opened a pull request for"
         |
    Step 1: Tokenization
         |
    Step 2: Word Embeddings
         |
    Step 3: Positional Encoding
         |
    Steps 4-6: N × Transformer Block (GPT-2 Pre-Norm style)
              (LayerNorm  Attention  Residual  LayerNorm  FFN  Residual)
         |
    Step 7: Final Layer Norm
         |
    Step 8: Linear  Softmax  Output Probability
         |
    Predicted next token: "review" (highest probability)

15.2 Steps 1-3: Input Processing

15.2.1 Step 1: Tokenization

Text becomes numbers. Using tiktoken with cl100k_base:

Input: "The agent opened a pull request for"

Token IDs: [791, 8479, 9107, 264, 6958, 1715, 369]
Length: 7 tokens

Each word or subword piece maps to a unique integer ID. The tokenizer does not see characters — it sees learned vocabulary entries.

15.2.2 Step 2: Word Embeddings

Token IDs enter the model via a lookup table:

Token IDs [7]
    | Embedding lookup (vocab_size × d_model matrix)
Token Embeddings [7, 512]

Each token becomes a 512-dimensional vector carrying semantic information about that token. This is the first place where meaning lives as geometry.

15.2.3 Step 3: Positional Encoding

Positional encoding visualization

Attention has no built-in sense of order. Without position information, the model cannot distinguish "the agent tagged the reviewer" from "the reviewer tagged the agent." Position encoding fixes that:

Token Embeddings [7, 512]
    | + Positional Encoding [7, 512]
Input Vectors [7, 512]

Two common strategies:

  • Original Transformer: fixed sinusoidal functions (no training required)
  • GPT series: learned positional embeddings (trained end-to-end)

Either way, every position gets a unique encoding, and nearby positions get similar encodings.

Output: each token is now a 512-dimensional vector that simultaneously encodes what it is and where it sits.


15.3 Step 4: Inside a Transformer Block

15.3.1 Block Structure

Transformer Block internal structure

Each Transformer Block has two sub-layers:

Input X [7, 512]
    |
+-------------------------------+
|  LayerNorm                    |
|      |                        |
|  Multi-Head Attention         |  <- understands relationships between tokens
|      |                        |
|  Dropout -> + X (residual)    |
+-------------------------------+
    |
+-------------------------------+
|  LayerNorm                    |
|      |                        |
|  Feed Forward Network         |  <- feature transformation
|      |                        |
|  Dropout -> + X (residual)    |
+-------------------------------+
    |
Output [7, 512]

The critical property: input is [7, 512], output is still [7, 512]. The dimension does not change through any block. Only the final projection breaks that invariant.

15.3.2 Multi-Head Attention in Detail

Attention block step by step

Attention is the core of the Transformer. Let me break it down step by step.

Step 4.1: Generate Q, K, V

Input X [7, 512]
    | × Wq, Wk, Wv (three weight matrices)
Q, K, V each [7, 512]
    | split across heads
Each head: Q, K, V each [7, 64]  (assuming 8 heads)

Step 4.2: Compute Attention Scores

Self-Attention diagram
Attention Score=QKTdk\text{Attention Score} = \frac{QK^T}{\sqrt{d_k}}

The dot product of Q and K measures similarity — how much should token i attend to token j.

Step 4.3: Visualize the Attention Matrix

QK attention scores before masking

The raw Q × K^T result is a 7×7 matrix. Each cell is the similarity score between one pair of token positions. Darker means higher similarity.

Step 4.4: Apply Causal Mask

QK attention scores after causal mask

The lower-triangular mask sets the upper triangle to -inf. After Softmax, -inf becomes 0. This is the Causal Mask — each position can only see tokens that come before it (or itself). This is what makes the model safe to train with teacher forcing and honest at inference time.

Step 4.5: Softmax and Weighted Sum

Attention Weights = Softmax(Masked Scores)
Output = Attention Weights × V

Each position's output is a weighted average of all V vectors, where the weights are the attention scores. The model learned what to pay attention to during training.

15.3.3 Feed Forward Network

The FFN is a simple two-layer network that operates independently on each token position:

Input [7, 512]
    | Linear: 512  2048  ( expansion)
    | ReLU activation
    | Linear: 2048  512  (back to model width)
Output [7, 512]

The FFN stores the majority of the model's "factual knowledge." It accounts for nearly half of all parameters — more on that in the parameter count section below.


15.4 Steps 5-6: Residual Connections and LayerNorm

15.4.1 Why Residual Connections?

Every sub-layer wraps its computation in a residual connection:

output = x + sublayer(x)  # not: output = sublayer(x)

Benefits:

  • Gradients can flow directly backward through the identity path, avoiding vanishing
  • The network can stack many layers without training instability
  • If a sub-layer learns nothing useful, the original signal passes through unchanged

This is the bypass lane from Chapter 13. Without it, 48-layer GPT-2 would not converge.

15.4.2 LayerNorm Placement

GPT-2 uses Pre-Norm:

# Pre-Norm (GPT-2, LLaMA, most modern models)
output = x + attention(layernorm(x))

# Post-Norm (original Transformer, 2017)
output = layernorm(x + attention(x))

Pre-Norm normalizes the input before the sublayer, not the combined output. This stabilizes training, especially in the early steps when parameter scales are unpredictable.


15.5 Step 7: Stacking Multiple Blocks

15.5.1 N Repetitions

GPT-2 stacks 12 to 48 blocks depending on model size:

Block 1 [7, 512]  Block 2 [7, 512]  ...  Block 12 [7, 512]

Each block:

  • preserves the shape [seq, d_model]
  • refines the representation with another round of Attention + FFN
  • builds increasingly abstract features as depth increases

Early blocks tend to handle syntax and local patterns. Later blocks handle longer-range dependencies and more abstract semantics. This is not a design decision — it emerged from training.

15.5.2 Where All the Parameters Live

Full architecture with parameter locations annotated
ComponentParameter formulaExample (d_model=512, vocab=100,256)
Word Embeddingvocab × d_model~51M
Attention (×12)4 × d_model² × 12~12.6M
FFN (×12)8 × d_model² × 12~25.2M
Output Lineard_model × vocab~51M

15.6 Step 8: Output Mapping

15.6.1 Final LayerNorm

After all blocks, one more LayerNorm before projection:

Block 12 output [7, 512]
    | LayerNorm
Normalized output [7, 512]

15.6.2 Linear Layer: Projecting to Vocabulary

Output linear projection to vocabulary

The key step: map the 512-dimensional hidden vector to a 100,256-dimensional logit vector.

Input [batch, seq, d_model] = [4, 7, 512]
    | @ Wp [d_model, vocab_size]
Output [batch, seq, vocab_size] = [4, 7, 100256]

15.6.3 What the Wp Matrix Means

Wp matrix multiplication

Think of Wp as: every token in the vocabulary has a d_model-dimensional signature vector. The output logit for token i is the dot product between the current hidden state and token i's signature — a similarity score.

Wp explanation

High dot product → high logit → model thinks this token is a likely next token.

15.6.4 Softmax to Probabilities

Final probability distribution
logits [7, 100256]
    | Softmax (over last dimension)
probs [7, 100256]

Now every position has a probability distribution over the vocabulary:

  • all probabilities sum to 1
  • the highest probability token is the model's prediction
  • the full distribution is what sampling strategies use

15.7 Full Shape Tracking

15.7.1 From Input to Output

token_ids:               [batch=4, seq=7]

Steps 1-2: Embedding:   [4, 7, 512]
Step 3: + Position:     [4, 7, 512]

Steps 4-6: Blocks 1-12: [4, 7, 512]  <- shape never changes!

Step 7: Final LayerNorm:[4, 7, 512]

Step 8: Linear:         [4, 7, 100256]
        Softmax:        [4, 7, 100256]  <- now probabilities

Take last position:     [4, 100256]
argmax:                 [4]  <- predicted token ID per sequence

The dimension is stable at d_model throughout every block. It only explodes to vocab_size at the very end.

15.7.2 Key Dimension Parameters

ParameterMeaningGPT-2 SmallGPT-2 Large
d_modelmodel width7681280
n_layersblock count1236
n_headsattention heads1220
d_ffFFN hidden dim30725120
vocab_sizevocabulary size50,25750,257

15.8 Parameter Count

15.8.1 Per-Component Breakdown

Using GPT-2 Small as the example (d_model=768, n_layers=12, vocab_size=50,257):

ComponentFormulaParameters
Token Embeddingvocab × d_model~38.6M
Position Embeddingmax_len × d_model~0.8M
Attention (×12)4 × d_model² × 12~28.3M
FFN (×12)2 × d_model × d_ff × 12~56.6M
LayerNorm (×25)2 × d_model × 25~0.04M
Output Projectionshared with Token Embedding0*

*Output projection usually shares weights with the token embedding table (weight tying). This is not an optimization — it is a modeling choice that says "the same geometry that encodes token meaning should also be used to score token likelihood."

Total: approximately 124 million parameters.

15.8.2 Parameter Distribution

Embedding:  ~31%  ||||||||
Attention:  ~23%  ||||||
FFN:        ~46%  ||||||||||||
LayerNorm:  <1%

FFN holds nearly half the parameters. This is why people say FFN layers store the model's knowledge — there is simply more room there.


15.9 Backpropagation During Training

15.9.1 Loss Function

Backpropagation update flow

During training, we know the target (the actual next token), so we can compute cross-entropy loss:

Loss = CrossEntropy(predicted_probs, target_token)

15.9.2 Gradient Flow

The loss propagates backward through every component:

Loss
 |
Output Projection (Wp) <- update
 |
LayerNorm <- update
 |
Block 12 (Attention, FFN) <- update
 |
...
 |
Block 1 <- update
 |
Embeddings <- update

Residual connections are critical here. They provide gradient highways that bypass each sub-layer, preventing the vanishing gradients that would otherwise stall training at depth.


15.10 Chapter Summary

15.10.1 Eight-Step Forward Pass

StepOperationShape transition
1Tokenizationtext → token IDs
2EmbeddingIDs → vectors [seq, d_model]
3+ Positionadd positional signal
4Attentioncapture token relationships
5Residual + Normstabilize training
6FFNfeature transformation
7× N Blocksrepeat steps 4-6
8Linear + Softmaxoutput probability distribution

15.10.2 Parameter Distribution

ComponentShare of parametersRole
Embedding~30%semantic token representations
Attention~25%capturing inter-token relationships
FFN~45%knowledge storage, feature transformation

15.10.3 Core Insight

The Transformer forward pass is an elegant pipeline. Tokens become vectors, pass through N layers of Attention (context understanding) plus FFN (feature extraction), then project to a vocabulary-sized probability distribution. The shape stays fixed at d_model through every block — only the final projection breaks the invariant.


Chapter Checklist

After this chapter you should be able to:

  • Describe all eight steps of the Transformer forward pass.
  • Track tensor shapes from token IDs through to logits.
  • Explain what the causal mask does and why it is necessary.
  • Explain why FFN accounts for nearly half of all parameters.
  • Estimate per-component parameter counts given d_model, n_layers, and vocab_size.

Code Implementation

The complete forward pass described here is implemented step by step in Part 5 (Chapters 18-20):

  • Chapter 18: model.py — model definition
  • Chapter 19: train.py — training loop
  • Chapter 20: inference.py — inference logic

See You in the Next Chapter

That is the complete forward pass. If you can trace a tensor from input text to output probabilities without looking at the diagram, you are ready for Chapter 16.

Chapter 16 compares training and inference — the same forward pass, but operating in two very different modes. Understanding that distinction is where a lot of production confusion lives.

Cite this page
Zhang, Wayland (2026). Chapter 15: Full Transformer Forward Pass - From Input to Output. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/chapter-15-full-forward-pass
@incollection{zhang2026transformer_chapter_15_full_forward_pass,
  author = {Zhang, Wayland},
  title = {Chapter 15: Full Transformer Forward Pass - From Input to Output},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/chapter-15-full-forward-pass}
}