One-sentence summary: Training sees the full sequence and updates parameters in parallel; inference sees only the prompt and must generate token by token — this autoregressive constraint is what makes GPT produce coherent text and what makes inference the latency bottleneck.

16.1 Training vs Inference: The Core Difference

16.1.1 Side-by-Side Comparison

	Training	Inference
Goal	learn parameters	generate text
Input	full text sequence	initial prompt
Targets	known (next token at each position)	unknown (must be predicted)
Processing	parallel (one pass, all positions)	serial (one token per pass)
Parameter updates	yes	no

16.1.2 Why the Difference Exists

During training:

We have the complete text, say: "The agent opened a pull request for review"
We know the correct next token at every position
We can compute loss over all positions simultaneously

During inference:

We only have the prompt: "The agent opened a pull request"
We do not know what comes next
We must predict one token, observe it, then predict the next

That asymmetry — knowing the answer versus not knowing it — is the entire explanation.

16.2 Training in Detail

16.2.1 Teacher Forcing

Training uses a technique called Teacher Forcing:

Input:  The  agent  opened  a  pull  request
Target: agent opened  a     pull request  for

The input is the original sequence. The target is the original sequence shifted right by one position.

Every position is simultaneously predicting its successor:

position 0 ("The") → predict "agent"
position 1 ("agent") → predict "opened"
position 2 ("opened") → predict "a"
...
position 5 ("request") → predict "for"

16.2.2 Parallel Computation

Because all inputs and targets are known, we compute the whole sequence in one forward pass:

# training step sketch
def train_step(model, input_ids, target_ids):
    # one forward pass for the entire sequence
    logits = model(input_ids)  # [batch, seq_len, vocab_size]

    # compute loss at every position simultaneously
    loss = F.cross_entropy(
        logits.view(-1, vocab_size),
        target_ids.view(-1)
    )

    # backprop
    loss.backward()
    optimizer.step()

One forward pass. Every position. This is what makes training computationally efficient.

16.2.3 The Role of the Causal Mask

Even though training sees the entire sequence at once, each position is only allowed to attend to positions that come before it. The future must remain invisible.

This is the Causal Mask from Chapter 15:

Position 0 sees: [The, -, -, -, -, -]
Position 1 sees: [The, agent, -, -, -, -]
Position 2 sees: [The, agent, opened, -, -, -]
...

The mask enforces the same information constraint that exists at inference time. Without it, training would be cheating: position 3 would see its own target and produce trivially low loss.

16.3 Inference in Detail

16.3.1 Autoregressive Generation

At inference time, the model generates one token per forward pass:

Prompt: "The agent opened a pull request"
         |
      model forward pass
         |
Output distribution: [for=18%, to=12%, ...]
         |
      sample "for"
         |
New input: "The agent opened a pull request for"
         |
      model forward pass
         |
Output distribution: [review=34%, approval=15%, ...]
         |
      sample "review"

This is autoregressive generation: each step's output becomes the next step's input.

16.3.2 Step-by-Step Example

Generating from the prompt "The agent opened a pull request":

Step 1:

Input:  "The agent opened a pull request"
Predict: "for"

Step 2:

Input:  "The agent opened a pull request for"
Predict: "review"

Step 3:

Input:  "The agent opened a pull request for review"
Predict: "."

Step 4:

Input:  "The agent opened a pull request for review."
Predict: <end-of-sequence> or next sentence begins

This continues until the model generates an end-of-sequence token or hits max_new_tokens.

16.3.3 Context Length and Truncation

Every model has a maximum context length, for example context_length = 2048. If the running sequence exceeds that limit, the model truncates:

keep only the most recent context_length tokens
discard earlier tokens

This is the "context window" limit you encounter in every LLM API. The model has not "forgotten" earlier tokens in any psychological sense — they simply were never in the input to this forward pass.

16.4 Inference Code

16.4.1 Basic Inference Loop

# inference sketch
def generate(model, prompt_ids, max_new_tokens=50):
    """
    Autoregressive text generation.

    Args:
        model: the GPT model
        prompt_ids: initial prompt token IDs [1, seq_len]
        max_new_tokens: maximum number of new tokens to generate
    """
    model.eval()  # switch to inference mode (disables Dropout)

    generated = prompt_ids.clone()

    for _ in range(max_new_tokens):
        # truncate if over max context length
        input_ids = generated[:, -max_len:]

        # forward pass
        with torch.no_grad():  # no gradient computation
            logits = model(input_ids)

        # take last position logits only
        next_token_logits = logits[:, -1, :]  # [1, vocab_size]

        # greedy: pick the highest probability token
        next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)

        # append to sequence
        generated = torch.cat([generated, next_token], dim=1)

        # stop at end-of-sequence
        if next_token.item() == eos_token_id:
            break

    return generated

16.4.2 Key Points

model.eval(): puts the model in evaluation mode, disabling Dropout so output is stable
torch.no_grad(): skips gradient tracking, saves memory
Only take the last position: logits[:, -1, :] — only the final position predicts the next token
Loop: each new token extends the sequence by one position

16.5 Padding and Batch Inference

16.5.1 What Is Padding?

Production systems run multiple requests in parallel. Different prompts have different lengths, so sequences are padded to the same length:

Request 1: "The agent opened a pull request"  (6 tokens)
Request 2: "Summarize"                         (1 token)

After padding to length 6:
Request 1: "The agent opened a pull request"
Request 2: "<pad> <pad> <pad> <pad> <pad> Summarize"

16.5.2 Handling Padding

The attention computation must ignore pad positions:

# attention_mask: 1 = real token, 0 = padding
attention_mask = (input_ids != pad_token_id).long()

# pad positions receive zero attention weight

The model sees the mask and learns not to attend to padding tokens. The output at pad positions is discarded.

16.6 Training vs Inference: Computation Comparison

16.6.1 Data Flow

During training:

Full sequence [seq_len]
    | one forward pass
All-position predictions [seq_len, vocab]
    | compare with targets
Loss
    | backprop
Update parameters

During inference:

Prompt [n]
    | forward pass
Predict position n+1
    | sample token
New sequence [n+1]
    | forward pass
Predict position n+2
    | sample token
... (loop until done)

16.6.2 Efficiency Comparison

	Training	Inference
Forward passes per sequence	1	N (N = generation length)
Parallelism	high (all positions at once)	low (serial by definition)
Bottleneck	memory (storing gradients)	latency (repeated forward passes)

This is why inference needs optimization techniques like KV Cache — which Chapter 22 covers in detail.

16.6.3 Dropout Behavior

	Training	Inference
Dropout	active (random drop)	disabled
Why	prevents overfitting	ensures stable, deterministic output

model.train()  # Dropout active
model.eval()   # Dropout disabled — do not forget this before sampling

I have seen bugs where the model produced different outputs on every call because someone left it in training mode. model.eval() is a one-liner that costs nothing.

16.7 Decoding Strategies

16.7.1 How to Choose the Next Token

Given a probability distribution over the vocabulary, how do we select the next token?

Greedy Decoding:

next_token = torch.argmax(probs, dim=-1)  # always pick the highest

Deterministic, fast
Tends toward repetitive, safe text

Sampling:

next_token = torch.multinomial(probs, num_samples=1)  # sample from distribution

More creative and varied
Can occasionally produce incoherent sequences

Top-K Sampling:

# sample only from the K highest-probability tokens
top_k_probs, top_k_indices = torch.topk(probs, k=50)
next_token = top_k_indices[torch.multinomial(top_k_probs, 1)]

Top-P (Nucleus) Sampling:

# sample from the smallest set of tokens whose cumulative probability >= P
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_probs, dim=-1)
mask = cumsum <= 0.9  # P = 0.9
# sample within the mask

16.7.2 Temperature

Temperature controls distribution sharpness:

probs = F.softmax(logits / temperature, dim=-1)

T < 1: more concentrated, more deterministic
T = 1: standard distribution
T > 1: flatter distribution, more randomness

Most production LLM APIs expose temperature as a parameter. The default is usually 1.0 or close to it.

16.8 Why Autoregressive?

16.8.1 Language Is Ordered

"The agent approved the pull request" and "The pull request approved the agent" use the same words but mean entirely different things. Language depends on order.

Autoregressive generation preserves that:

each token is conditioned on all previous tokens
the generated text is coherent with what came before
the model can adapt mid-generation as the context grows

16.8.2 Non-Autoregressive Models

Some models attempt to generate all tokens in parallel. They generally produce lower-quality output than autoregressive models, especially on open-ended generation tasks.

The reason is strong token-to-token dependencies. Generating "the" tells you something specific about what "agent" means in this context. Parallel generation misses that dependency.

GPT-4, Claude, LLaMA, Gemini — all autoregressive. The pattern has proven robust enough that the field has not abandoned it despite the latency cost.

16.9 Chapter Summary

16.9.1 Core Comparison

Aspect	Training	Inference
Targets known?	yes	no
Processing	parallel	serial
Forward passes	1 per sequence	N per sequence
Dropout	on	off
Parameter updates	yes	no

16.9.2 Autoregressive Generation

Prompt -> predict token 1 -> append -> predict token 2 -> append -> ...

Each step depends on all previous tokens. That dependency is what makes output coherent, and what makes inference slow.

16.9.3 Core Insight

Training can parallelize because the answers are known; inference must be serial because each new token depends on every token before it. That serial constraint is the main inference latency bottleneck, and it is exactly why KV Cache matters.

Chapter Checklist

After this chapter you should be able to:

Explain the core training vs inference difference in one sentence.
Describe autoregressive generation step by step.
Explain why inference is slower per sequence than training.
Name at least three decoding strategies and when to use each.

See You in the Next Chapter

That is training vs inference. If you can explain why training is parallel but inference is serial without looking at the diagram, you are ready for Chapter 17.

Autoregressive generation means running the forward pass N times per output — efficiency matters. But before we get to optimization techniques like KV Cache, there is one more training concept to solidify: learning rate. Chapter 17 explains what it does, why it is the most important hyperparameter in practice, and how to configure it without guessing.