One-sentence summary: Training sees the full sequence and updates parameters in parallel; inference sees only the prompt and must generate token by token — this autoregressive constraint is what makes GPT produce coherent text and what makes inference the latency bottleneck.
16.1 Training vs Inference: The Core Difference
16.1.1 Side-by-Side Comparison
| Training | Inference | |
|---|---|---|
| Goal | learn parameters | generate text |
| Input | full text sequence | initial prompt |
| Targets | known (next token at each position) | unknown (must be predicted) |
| Processing | parallel (one pass, all positions) | serial (one token per pass) |
| Parameter updates | yes | no |
16.1.2 Why the Difference Exists
During training:
- We have the complete text, say: "The agent opened a pull request for review"
- We know the correct next token at every position
- We can compute loss over all positions simultaneously
During inference:
- We only have the prompt: "The agent opened a pull request"
- We do not know what comes next
- We must predict one token, observe it, then predict the next
That asymmetry — knowing the answer versus not knowing it — is the entire explanation.
16.2 Training in Detail
16.2.1 Teacher Forcing
Training uses a technique called Teacher Forcing:
Input: The agent opened a pull request
Target: agent opened a pull request for
The input is the original sequence. The target is the original sequence shifted right by one position.
Every position is simultaneously predicting its successor:
- position 0 ("The") → predict "agent"
- position 1 ("agent") → predict "opened"
- position 2 ("opened") → predict "a"
- ...
- position 5 ("request") → predict "for"
16.2.2 Parallel Computation
Because all inputs and targets are known, we compute the whole sequence in one forward pass:
# training step sketch
def train_step(model, input_ids, target_ids):
# one forward pass for the entire sequence
logits = model(input_ids) # [batch, seq_len, vocab_size]
# compute loss at every position simultaneously
loss = F.cross_entropy(
logits.view(-1, vocab_size),
target_ids.view(-1)
)
# backprop
loss.backward()
optimizer.step()
One forward pass. Every position. This is what makes training computationally efficient.
16.2.3 The Role of the Causal Mask
Even though training sees the entire sequence at once, each position is only allowed to attend to positions that come before it. The future must remain invisible.
This is the Causal Mask from Chapter 15:
Position 0 sees: [The, -, -, -, -, -]
Position 1 sees: [The, agent, -, -, -, -]
Position 2 sees: [The, agent, opened, -, -, -]
...
The mask enforces the same information constraint that exists at inference time. Without it, training would be cheating: position 3 would see its own target and produce trivially low loss.
16.3 Inference in Detail
16.3.1 Autoregressive Generation
At inference time, the model generates one token per forward pass:
Prompt: "The agent opened a pull request"
|
model forward pass
|
Output distribution: [for=18%, to=12%, ...]
|
sample "for"
|
New input: "The agent opened a pull request for"
|
model forward pass
|
Output distribution: [review=34%, approval=15%, ...]
|
sample "review"
This is autoregressive generation: each step's output becomes the next step's input.
16.3.2 Step-by-Step Example
Generating from the prompt "The agent opened a pull request":
Step 1:
Input: "The agent opened a pull request"
Predict: "for"
Step 2:
Input: "The agent opened a pull request for"
Predict: "review"
Step 3:
Input: "The agent opened a pull request for review"
Predict: "."
Step 4:
Input: "The agent opened a pull request for review."
Predict: <end-of-sequence> or next sentence begins
This continues until the model generates an end-of-sequence token or hits max_new_tokens.
16.3.3 Context Length and Truncation
Every model has a maximum context length, for example context_length = 2048. If the running sequence exceeds that limit, the model truncates:
keep only the most recent context_length tokens
discard earlier tokens
This is the "context window" limit you encounter in every LLM API. The model has not "forgotten" earlier tokens in any psychological sense — they simply were never in the input to this forward pass.
16.4 Inference Code
16.4.1 Basic Inference Loop
# inference sketch
def generate(model, prompt_ids, max_new_tokens=50):
"""
Autoregressive text generation.
Args:
model: the GPT model
prompt_ids: initial prompt token IDs [1, seq_len]
max_new_tokens: maximum number of new tokens to generate
"""
model.eval() # switch to inference mode (disables Dropout)
generated = prompt_ids.clone()
for _ in range(max_new_tokens):
# truncate if over max context length
input_ids = generated[:, -max_len:]
# forward pass
with torch.no_grad(): # no gradient computation
logits = model(input_ids)
# take last position logits only
next_token_logits = logits[:, -1, :] # [1, vocab_size]
# greedy: pick the highest probability token
next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
# append to sequence
generated = torch.cat([generated, next_token], dim=1)
# stop at end-of-sequence
if next_token.item() == eos_token_id:
break
return generated
16.4.2 Key Points
model.eval(): puts the model in evaluation mode, disabling Dropout so output is stabletorch.no_grad(): skips gradient tracking, saves memory- Only take the last position:
logits[:, -1, :]— only the final position predicts the next token - Loop: each new token extends the sequence by one position
16.5 Padding and Batch Inference
16.5.1 What Is Padding?
Production systems run multiple requests in parallel. Different prompts have different lengths, so sequences are padded to the same length:
Request 1: "The agent opened a pull request" (6 tokens)
Request 2: "Summarize" (1 token)
After padding to length 6:
Request 1: "The agent opened a pull request"
Request 2: "<pad> <pad> <pad> <pad> <pad> Summarize"
16.5.2 Handling Padding
The attention computation must ignore pad positions:
# attention_mask: 1 = real token, 0 = padding
attention_mask = (input_ids != pad_token_id).long()
# pad positions receive zero attention weight
The model sees the mask and learns not to attend to padding tokens. The output at pad positions is discarded.
16.6 Training vs Inference: Computation Comparison
16.6.1 Data Flow
During training:
Full sequence [seq_len]
| one forward pass
All-position predictions [seq_len, vocab]
| compare with targets
Loss
| backprop
Update parameters
During inference:
Prompt [n]
| forward pass
Predict position n+1
| sample token
New sequence [n+1]
| forward pass
Predict position n+2
| sample token
... (loop until done)
16.6.2 Efficiency Comparison
| Training | Inference | |
|---|---|---|
| Forward passes per sequence | 1 | N (N = generation length) |
| Parallelism | high (all positions at once) | low (serial by definition) |
| Bottleneck | memory (storing gradients) | latency (repeated forward passes) |
This is why inference needs optimization techniques like KV Cache — which Chapter 22 covers in detail.
16.6.3 Dropout Behavior
| Training | Inference | |
|---|---|---|
| Dropout | active (random drop) | disabled |
| Why | prevents overfitting | ensures stable, deterministic output |
model.train() # Dropout active
model.eval() # Dropout disabled — do not forget this before sampling
I have seen bugs where the model produced different outputs on every call because someone left it in training mode. model.eval() is a one-liner that costs nothing.
16.7 Decoding Strategies
16.7.1 How to Choose the Next Token
Given a probability distribution over the vocabulary, how do we select the next token?
Greedy Decoding:
next_token = torch.argmax(probs, dim=-1) # always pick the highest
- Deterministic, fast
- Tends toward repetitive, safe text
Sampling:
next_token = torch.multinomial(probs, num_samples=1) # sample from distribution
- More creative and varied
- Can occasionally produce incoherent sequences
Top-K Sampling:
# sample only from the K highest-probability tokens
top_k_probs, top_k_indices = torch.topk(probs, k=50)
next_token = top_k_indices[torch.multinomial(top_k_probs, 1)]
Top-P (Nucleus) Sampling:
# sample from the smallest set of tokens whose cumulative probability >= P
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_probs, dim=-1)
mask = cumsum <= 0.9 # P = 0.9
# sample within the mask
16.7.2 Temperature
Temperature controls distribution sharpness:
probs = F.softmax(logits / temperature, dim=-1)
- T < 1: more concentrated, more deterministic
- T = 1: standard distribution
- T > 1: flatter distribution, more randomness
Most production LLM APIs expose temperature as a parameter. The default is usually 1.0 or close to it.
16.8 Why Autoregressive?
16.8.1 Language Is Ordered
"The agent approved the pull request" and "The pull request approved the agent" use the same words but mean entirely different things. Language depends on order.
Autoregressive generation preserves that:
- each token is conditioned on all previous tokens
- the generated text is coherent with what came before
- the model can adapt mid-generation as the context grows
16.8.2 Non-Autoregressive Models
Some models attempt to generate all tokens in parallel. They generally produce lower-quality output than autoregressive models, especially on open-ended generation tasks.
The reason is strong token-to-token dependencies. Generating "the" tells you something specific about what "agent" means in this context. Parallel generation misses that dependency.
GPT-4, Claude, LLaMA, Gemini — all autoregressive. The pattern has proven robust enough that the field has not abandoned it despite the latency cost.
16.9 Chapter Summary
16.9.1 Core Comparison
| Aspect | Training | Inference |
|---|---|---|
| Targets known? | yes | no |
| Processing | parallel | serial |
| Forward passes | 1 per sequence | N per sequence |
| Dropout | on | off |
| Parameter updates | yes | no |
16.9.2 Autoregressive Generation
Prompt -> predict token 1 -> append -> predict token 2 -> append -> ...
Each step depends on all previous tokens. That dependency is what makes output coherent, and what makes inference slow.
16.9.3 Core Insight
Training can parallelize because the answers are known; inference must be serial because each new token depends on every token before it. That serial constraint is the main inference latency bottleneck, and it is exactly why KV Cache matters.
Chapter Checklist
After this chapter you should be able to:
- Explain the core training vs inference difference in one sentence.
- Describe autoregressive generation step by step.
- Explain why inference is slower per sequence than training.
- Name at least three decoding strategies and when to use each.
See You in the Next Chapter
That is training vs inference. If you can explain why training is parallel but inference is serial without looking at the diagram, you are ready for Chapter 17.
Autoregressive generation means running the forward pass N times per output — efficiency matters. But before we get to optimization techniques like KV Cache, there is one more training concept to solidify: learning rate. Chapter 17 explains what it does, why it is the most important hyperparameter in practice, and how to configure it without guessing.