If you can answer these questions clearly, you understand the core of the book.
C.1 Basic Concepts
Q1: GPT vs. Transformer — what is the difference?
GPT (Generative Pre-trained Transformer) is a specific model built on the Transformer architecture. Transformer is the underlying engine design — a general-purpose neural network architecture published by Google in 2017. GPT is one car built using that engine: OpenAI took the Transformer's decoder half, stacked many copies of it, and trained it on large amounts of text. The analogy holds further: BERT is a different car using the encoder half of the same engine, and T5 uses both halves together. When people say "I'm using a Transformer," they usually mean a specific product like GPT or LLaMA. When engineers say "the Transformer architecture," they mean the abstract blueprint that all of these models share.
Q2: Why is Attention called "attention"? Does the model actually pay attention?
The name comes from an analogy to human selective attention — the way you focus on certain words in a sentence when answering a question about it. The mechanism does something structurally similar: for each token, it computes a weighted sum over all other tokens, where the weights reflect relevance. High-weight tokens "get attended to" more. Whether that constitutes understanding or awareness in any cognitive sense is a separate question, and the honest answer is: no. Attention is a differentiable routing operation. It does not imply the model comprehends meaning, has intentions, or forms concepts. The name is evocative and pedagogically useful, but do not let it imply more than the math warrants. Chapter 3 covers the full mechanism.
Q3: What is a token? Is it a word, a character, or something else?
A token is the unit the model actually processes — and it is none of the above, exactly. Most modern LLMs use a byte-pair encoding (BPE) tokenizer, which splits text into subword pieces. The word "unbelievable" might become three tokens: ["un", "believ", "able"]. A common English word like "the" is usually one token. A rare word or a word in a low-resource language might become four or five tokens. Chinese text typically produces one or two tokens per character. The practical implication: when a model advertises a 128k context window, that is 128,000 tokens, not 128,000 words — a difference of roughly 1.3× for English prose. Token count is the right unit for thinking about cost, speed, and context limits. Chapter 2 covers tokenization in detail.
Q4: Why generate tokens one at a time during inference? Why not all at once?
Because of the causal dependency: the probability of the next token depends on all previously generated tokens. Formally, the model is computing at each step. To generate , you must already have through . You cannot generate them all simultaneously without breaking this dependency. This is the autoregressive contract: the model eats its own output as future input. KV Cache (see Q15) makes each step faster by avoiding redundant recomputation, but the sequential structure is fundamental, not an implementation choice.
C.2 Attention
Q5: Why three matrices Q, K, V — not just two or one?
The short answer: separating Q, K, and V gives the model more expressive power than any two-matrix scheme. Here is the intuition. Q ("what am I looking for?") and K ("what do I offer?") are used together to compute a relevance score between positions. V ("what do I actually contribute?") is used separately to produce the output. If you collapsed K and V into one matrix, you would be forced to use the same representation for "being found" and "contributing content." That constraint is too tight — a token might be highly relevant to a query (high attention weight) but contribute something very different from what made it relevant. The three-matrix design lets those two things vary independently. One-matrix attention exists in the literature but consistently performs worse.
Q6: Why divide by in attention scores?
To prevent softmax saturation. When you compute a dot product between two vectors of dimension , the variance of the result grows proportionally to — assuming the elements of Q and K are roughly standard normal. For large (say, 128), the dot products can be very large in magnitude, pushing the softmax into regions where its gradient is nearly zero. That kills learning. Dividing by brings the variance back to 1, keeping softmax in a numerically healthy range. The original "Attention Is All You Need" paper derives this in one paragraph. It is one of the more elegant small decisions in the architecture.
Q7: Why multi-head? Is one big Attention head not enough?
Multiple heads let different heads specialize on different types of relationships simultaneously. One head might track syntactic dependencies (subject and verb across a clause), another might track coreference (a pronoun and its antecedent), another might track local proximity. With a single large head, all of that has to compete for the same parameter budget. The multi-head design also has a practical benefit: the total compute is similar to one large head (you split d_model across heads), but the representational diversity is higher. The number of heads is a hyperparameter — Chapter 5 covers common configurations. The rule of thumb is head_dim = 64 or 128, with n_heads = d_model / head_dim.
Q8: What does the causal mask actually do?
It prevents tokens from attending to future tokens during training. Without the mask, a token at position could see the token it is supposed to be predicting — which would make the training loss trivially small but the model useless at inference time. The mask is implemented by setting the attention logits for future positions to before the softmax, which drives those attention weights to zero. The result is that each token can only attend to itself and tokens to its left. This is what makes the architecture "autoregressive" — each position attends causally. Chapter 5 covers this in detail, including why you want the mask applied before softmax rather than after.
Q9: Why is Attention computation in sequence length?
Because every token attends to every other token. For a sequence of length , you compute an attention matrix — one score per pair of positions. Both the compute and memory scale as . For , that is a million entries. For , it is ten billion entries — which is why naive Attention does not scale to very long contexts. FlashAttention addresses the memory side by computing the matrix in tiles without materializing it fully. Sparse Attention approximations address the compute side by skipping many pairs. For most practical purposes up to ~8k tokens, standard Attention is fine.
C.3 Training
Q10: Why teacher forcing during training? Why not just sample from the model?
Teacher forcing feeds the model the correct previous tokens at each position during training, rather than the model's own predictions. This has two large practical benefits. First, errors do not compound — if the model predicts a wrong token at step 5, that mistake does not cascade into positions 6 through 100. Second, it enables full parallelism: because the inputs are fixed (the ground truth), you can compute the loss at all positions simultaneously with a single forward pass and a causal mask. The downside is exposure bias: at inference time the model sees its own (potentially wrong) outputs, which is a distribution it never saw during training. This gap is real but manageable in practice — teacher forcing is used in almost all LLM training because the benefits so far outweigh the costs.
Q11: What is a loss function intuitively? What is cross-entropy loss?
The loss function is the number the training process tries to minimize — a measure of how wrong the model currently is. For language modeling, the standard choice is cross-entropy loss:
The intuition: if the model assigns probability 0.9 to the correct token, the loss is small (). If it assigns probability 0.01, the loss is large (). Averaged over all positions in a batch, the cross-entropy loss is also directly interpretable as perplexity: . A lower perplexity means the model is less surprised by the data. Chapter 8 covers training objectives in detail.
Q12: How much training data is enough?
The Chinchilla paper (Hoffmann et al., 2022) gave a practical answer: for a compute-optimal run, use roughly 20 tokens of training data per model parameter. A 7B-parameter model, by this heuristic, wants ~140B tokens. Modern practice often exceeds this — LLaMA 3 trained on 15T tokens, far beyond Chinchilla-optimal for its size, because data is cheap relative to the inference cost savings of a smaller model. The practical constraint is usually not how much data you have, but how much compute you can afford for training. If you are fine-tuning rather than pre-training from scratch, the numbers are dramatically smaller — a few hundred thousand examples can meaningfully shift behavior.
Q13: Why warm up the learning rate? What happens without it?
At the start of training, model parameters are initialized randomly, and the gradient signal is noisy and poorly conditioned. If you start with your full learning rate immediately, the optimizer can make very large updates in poorly-constrained directions, destabilizing the loss or even causing NaN values. Warmup — linearly increasing the learning rate from near zero to the target over the first few thousand steps — gives the model a chance to establish reasonable gradient directions before the optimizer starts taking large steps. Without warmup, runs with large learning rates frequently diverge in the first few hundred steps. With warmup, the same learning rate is stable. The cost is a slightly slower start; the benefit is much higher reliability on large-scale runs.
C.4 Inference
Q14: Why does generating one token at a time feel slow?
Three reasons compound. First, the autoregressive structure means you cannot start generating token until token is done. Second, each step requires a full forward pass through all layers of the model — for a 70B-parameter model, that is substantial compute. Third, large models are often memory-bandwidth-bound rather than compute-bound: the bottleneck is reading the model parameters from GPU memory, not the arithmetic itself. KV Cache (Q15) removes one large source of redundant work, but it does not change the sequential structure. Batch inference (processing multiple requests together) improves GPU utilization but does not help latency for a single user. Speculative decoding is a recent technique that addresses the sequential bottleneck by drafting multiple tokens with a small model and verifying in parallel — it can give a 2–3× speedup on suitable hardware.
Q15: What is KV Cache and when should I worry about it?
At each generation step, Attention needs the keys and values for all previous tokens. Without a cache, you recompute them from scratch every step — which means generating the 100th token requires 99× redundant recomputation. KV Cache stores the keys and values from previous steps and appends new ones, reducing each step to computing only the new token's K and V. This changes per-step cost from to in the number of previously generated tokens. The cost: memory. A single KV Cache entry for one layer is 2 × n_heads × head_dim × context_length values. For a long-context model with many layers, KV Cache can consume tens of gigabytes of GPU memory. You should worry about it when you are running inference at scale, choosing between MHA and GQA/MQA, or reasoning about why your server runs out of memory under load. Chapter 9 covers this in detail.
Q16: What does "temperature" actually do?
Temperature scales the logits before the softmax:
At , you get the raw model distribution. At , the distribution sharpens — the highest-probability tokens get even more weight, and the output becomes more deterministic. At , the distribution flattens — lower-probability tokens get more weight, and the output becomes more varied and sometimes incoherent. At , you get greedy decoding: always pick the highest-probability token. In practice: use low temperature (0.2–0.5) for code or math where correctness matters; use higher temperature (0.7–1.0) for creative writing or open-ended dialogue. Temperature does not add creativity out of nowhere — it only adjusts how much the model leans into its own uncertainty.
Q17: What is the difference between Top-K, Top-P, and beam search?
These are three different strategies for choosing the next token from the model's probability distribution. Greedy decoding picks the single most likely token — fast but often repetitive. Top-K sampling restricts sampling to the K highest-probability tokens and renormalizes. Top-P (nucleus) sampling restricts to the smallest set of tokens whose cumulative probability exceeds P — so the effective K adapts to the model's confidence. Beam search maintains multiple candidate sequences simultaneously, expanding and pruning at each step to find high-probability full sequences. The right choice depends on the task: beam search works well for translation and summarization where you want accuracy; Top-P sampling works well for dialogue and creative writing where diversity matters. Most deployed LLM APIs (including ChatGPT and Claude) use sampling with Top-P, not beam search.
C.5 Architecture Details
Q18: LayerNorm vs. BatchNorm — why LayerNorm in Transformers?
BatchNorm normalizes across the batch dimension — it computes statistics over all examples in a batch at a given position. That works well for image models with fixed-size inputs, but it creates two problems for Transformers. First, sequence lengths vary, so there is no stable "same position" across batch elements to normalize over. Second, large model training often uses very small per-device batch sizes, making BatchNorm's statistics noisy and unstable. LayerNorm normalizes across the feature dimension for each token independently — no batch statistics needed, no sequence-length constraint. It also behaves identically at training and inference time, which matters for correctness. LayerNorm has been the standard for Transformer models since the original paper and there is no serious contender for replacing it.
Q19: Pre-Norm vs. Post-Norm — which is better?
The original 2017 Transformer used Post-Norm: x = LayerNorm(x + SubLayer(x)). Most modern LLMs use Pre-Norm: x = x + SubLayer(LayerNorm(x)). The difference matters for training stability in deep networks. With Post-Norm, the residual stream is normalized at the output of each layer — which helps at shallow depth but makes gradient flow through many layers harder. With Pre-Norm, the residual stream is never normalized on the main path, allowing cleaner gradient flow through the entire stack. Pre-Norm is now the default in almost every serious model (GPT-2, LLaMA, Mistral). If you are implementing a Transformer from scratch, use Pre-Norm. The original paper's Post-Norm required careful warmup and learning rate scheduling to avoid instability at depth.
Q20: Why does FFN expand by 4×?
It was an empirical choice in the original paper that has stuck because it works. The FFN in each Transformer block projects the d_model-dimensional residual to 4 × d_model, applies a nonlinearity, and projects back. The 4× expansion gives the FFN enough capacity to store and retrieve factual associations — the FFN is sometimes described as a key-value memory store that complements what Attention does. Wider is generally better up to a point, and 4× became a convention. Some architectures deviate: LLaMA uses a SwiGLU activation with roughly 8/3× expansion to achieve similar capacity with a gating mechanism. The specific ratio matters less than having enough capacity; 4× is a reasonable starting point.
Q21: What is a residual connection, and why is it necessary?
A residual connection (also called a skip connection) adds the layer's input directly to its output: output = x + SubLayer(x). The intuition: rather than forcing each layer to learn the full transformation, you let it learn only the incremental change (the "residual"). This has two critical effects. First, it creates a highway for gradients to flow directly from the loss back to early layers without passing through every nonlinearity — dramatically improving training of deep networks. Second, at initialization, each layer contributes something close to zero to the residual (because weights are small), so the initial computation is nearly the identity function, which is a stable starting point. Without residual connections, a 100-layer Transformer is nearly untrainable. With them, it trains almost as reliably as a shallow model.
C.6 Practical Issues
Q22: How do I fine-tune a model without GPUs?
The honest answer is: it depends on the model size and task. For models up to ~3B parameters, you can fine-tune on a modern CPU with enough RAM and patience — it will be slow but possible. More practically: LoRA (Low-Rank Adaptation) dramatically reduces the number of trainable parameters, making fine-tuning feasible on consumer GPUs or even colab-grade hardware. QLoRA adds 4-bit quantization on top of LoRA, bringing 7B model fine-tuning within reach of a single 16GB GPU. For no-GPU setups, cloud providers (Google Colab, Lambda, Vast.ai) offer hourly GPU rentals that are often cheaper than you expect for short fine-tuning runs. The bigger constraint is usually data quality, not compute — a few thousand high-quality examples fine-tuned with LoRA often outperforms a hundred thousand noisy examples.
Q23: My model output is repetitive. What do I fix?
Repetition is almost always a temperature and sampling issue, not a model architecture issue. First check: is temperature set too low? Very low temperature (below 0.3) causes the model to loop because the highest-probability continuation is often literally the same token it just produced. Try raising temperature to 0.7–1.0. Second: add a repetition penalty (repetition_penalty = 1.1 to 1.3 in most inference libraries), which discounts the logits of recently generated tokens. Third: switch from greedy or low-temperature sampling to Top-P sampling (top_p = 0.9). If repetition persists after these adjustments, the underlying model may have been trained on repetitive data — a data quality problem that requires either more fine-tuning or a different base model.
Q24: My model output is incoherent. What do I fix?
Incoherence has a different root cause than repetition: the model is sampling from high-entropy regions of its distribution. Check temperature first — above 1.2 the output often becomes word-soup. Lower it to 0.7–1.0. If using Top-K, try switching to Top-P with top_p = 0.9, which is more adaptive. If the problem persists at reasonable temperature, check whether the prompt is giving the model enough context — very short or ambiguous prompts produce high-variance outputs. Also check whether the model is actually appropriate for your task: a base model (not instruction-tuned) requires a very specific prompt format to produce coherent dialogue. If you are fine-tuning and see incoherence, check whether training loss is still high — the model may simply not be trained enough yet.
Q25: How do I estimate GPU memory for a given model?
A rough formula for inference in half precision (BF16/FP16): model parameters × 2 bytes. A 7B-parameter model needs about 14 GB just for parameters. Add KV Cache on top: 2 × n_layers × n_heads × head_dim × context_length × batch_size × 2 bytes. For a 7B model with 32 layers, 32 heads, head_dim 128, context 4096, batch size 1: roughly 2 GB additional. For training, multiply the parameter memory by about 12–18× to account for gradients, optimizer states (Adam uses two additional tensors per parameter), and activations. LoRA + 4-bit quantization can bring a 7B model's training footprint down to ~8–10 GB, which fits on a consumer GPU. When in doubt, add 20% margin to your estimate — memory fragmentation and framework overhead are real.
C.7 Debugging
Q26: Training loss is NaN. What happened?
NaN loss almost always has one of a small set of causes. The most common: learning rate too high, causing parameter updates that overflow float16. Try reducing learning rate by 10× or switching to BF16 (which has a wider dynamic range than FP16). Second most common: a bad batch — a sequence with zero valid tokens, or all tokens masked, can produce a 0/0 in the loss computation. Third: gradient explosion — add gradient clipping (max_norm = 1.0) if you have not already. Fourth: a NaN in the input data itself, which propagates forward. The debugging procedure: add a check for NaN in the loss and log the batch index when it first appears; inspect that batch for anomalies. If you see NaN in the first step, it is almost certainly initialization or data; if it appears many steps in, it is usually learning rate or a particularly difficult batch.
Q27: Training loss decreases but the model generates garbage. Why?
The loss is measuring the right thing (predicting the next token from ground truth), but that does not guarantee generation quality. Several failure modes fit this description. One: the model has memorized the training distribution but cannot generalize — check with a held-out validation set. Two: temperature is too high during evaluation; try temperature 0.7 or greedy. Three: tokenizer mismatch — if you are encoding inputs with one tokenizer and decoding with another, the output will be nonsense even with perfect logits. Four: the model is an instruction-tuned checkpoint being prompted as a base model, or vice versa. Five (less common): numerical precision issues during inference — ensure you are using the same dtype as training. The most reliable diagnostic is to run the model with teacher forcing on a known example and check whether the per-token probabilities look reasonable.
Q28: Same prompt, different outputs at different times. Bug or feature?
Feature, intentionally. If you are using any temperature above zero or any sampling strategy other than pure greedy, the model samples from a probability distribution — different runs produce different draws. This is by design: deterministic greedy outputs are often repetitive and less useful. If you need reproducibility, set a fixed random seed in your inference library and use temperature 0 (greedy). If you need consistency across API calls, look for a seed parameter — most modern inference APIs expose one. The underlying model is deterministic given the same input and random seed; the apparent non-determinism comes from the sampling step. On some hardware (multi-GPU inference with non-deterministic ops), you can get different results even with the same seed — this is a deeper reproducibility problem specific to distributed inference.
C.8 Concept Disambiguation
Q29: Attention vs. Self-Attention vs. Cross-Attention — what is what?
Attention is the general mechanism: compute weighted sums of values using relevance scores between queries and keys. Self-Attention is Attention where the queries, keys, and values all come from the same sequence — each token attends to all other tokens in its own sequence. Cross-Attention is Attention where the queries come from one sequence and the keys and values come from a different sequence. In GPT-style decoder-only models, every Attention layer is Self-Attention (the model only sees its own sequence). In encoder-decoder models like the original Transformer or T5, the decoder has both Self-Attention (attending to previously generated tokens) and Cross-Attention (attending to the encoder's output). If someone says "the model uses Attention," they almost certainly mean Self-Attention.
Q30: Encoder-only vs. Decoder-only vs. Encoder-Decoder — when to use which?
Encoder-only models (BERT, RoBERTa) read the full input bidirectionally and produce rich representations. They are best for classification, named entity recognition, and other tasks where you need a deep understanding of a fixed input. Decoder-only models (GPT, LLaMA, Mistral) generate text autoregressively and are best for open-ended generation, instruction following, and chat. Encoder-Decoder models (T5, BART, the original Transformer) use an encoder to compress the input and a decoder to generate output — best for tasks with a clear input-output structure like translation or summarization. The field has largely converged on decoder-only for general-purpose LLMs, because scaling laws favor it and instruction tuning can teach it classification and extraction tasks without an encoder. Encoder-only models remain strong for embedding and retrieval tasks.
Q31: Pre-training vs. Fine-tuning vs. Alignment vs. RLHF — how do they relate?
These are four stages of the modern LLM production pipeline, in order. Pre-training: train a model from scratch on a large corpus to predict the next token — this is where the model acquires its general knowledge and language ability. Fine-tuning (also called SFT, supervised fine-tuning): train the pre-trained model on curated examples of the target task or format — this teaches it to follow instructions and produce the right style of output. Alignment: the broad goal of making the model behave helpfully, harmlessly, and honestly — fine-tuning is one tool, but alignment is the broader engineering and research problem. RLHF (Reinforcement Learning from Human Feedback): a specific alignment technique where human raters rank model outputs, a reward model learns those preferences, and the policy (the LLM) is optimized against the reward model using PPO or a similar algorithm. RLHF is the technique that made ChatGPT feel qualitatively different from a raw pre-trained model. The four stages are sequential; skipping fine-tuning and alignment leaves a model that can predict text but does not usefully answer questions.
C.9 Advanced Topics
Q32: What exactly is "emergent capability" and is it real?
Emergence refers to capabilities that appear in larger models but not smaller ones, seemingly discontinuously — for example, multi-step arithmetic, chain-of-thought reasoning, or in-context learning appearing around certain model scales. The empirical observation is real: some tasks show near-zero performance below a threshold and near-human performance above it. The interpretation is contested. One view: these capabilities genuinely emerge from scale and are not simply latent in smaller models. The counterargument (Wei et al., 2022; Schaeffer et al., 2023): the discontinuity is often an artifact of the evaluation metric — using a continuous metric instead of accuracy often reveals smooth scaling. The practical takeaway is genuine: scale matters enormously for certain capabilities, and some things that are impossible at 1B parameters become easy at 70B. Whether to call that "emergence" or "steep scaling" is partly semantic.
Q33: Why do reasoning models like o1 and R1 need "thinking time"?
Standard LLMs generate one token at a time and commit to each token immediately. For problems requiring multi-step reasoning — a math proof, a logic puzzle, a complex coding task — a single forward pass often does not give the model enough computation to arrive at the right answer. Reasoning models address this by generating intermediate "thinking" tokens (sometimes visible as a scratchpad, sometimes hidden) before producing the final answer. This converts a hard single-step problem into a series of easier steps the model can generate autoregressively. The deeper point is that inference compute can substitute for model size and training compute up to a point — a smaller model that "thinks" for longer can match a larger model that generates immediately. This is why the field is increasingly interested in test-time compute scaling alongside training-time compute scaling.
Q34: What makes Mamba potentially better than Transformer? What makes it worse?
Mamba is a selective state-space model (SSM) that processes sequences with linear (rather than quadratic) time and memory complexity. The key mechanism is a learned, input-dependent selection matrix that decides what information to retain and what to discard at each step — analogous to Attention, but operating as a recurrence rather than an all-pairs comparison. The potential advantages: much faster inference for long sequences (no bottleneck), constant-size memory regardless of context length, and faster training at very long sequence lengths. The disadvantages: because Mamba compresses past context into a fixed-size hidden state, it can lose information that Attention would retain exactly. Empirically, at the sequence lengths common in current LLMs (4k–128k tokens), well-tuned Transformers still match or exceed Mamba on most benchmarks. The architectural competition is ongoing, and hybrid models (alternating SSM and Attention layers) show real promise.
C.10 Learning Advice
Q35: I understand each component, but the whole still feels fuzzy. What do I do?
Implement a tiny GPT from scratch. Not in ten minutes using a library — from the raw matrix operations. Write the embedding lookup, the Q/K/V projections, the scaled dot-product attention, the causal mask, the softmax, the FFN, the residual additions, the LayerNorm, and the output projection. Stack a few layers. Write a training loop. Train it on a small text corpus until the loss drops. Generate a few tokens. This exercise forces the fuzzy pieces to crystallize because bugs in the connections surface immediately. The way I explain this to engineers: reading about riding a bicycle and understanding each component (handlebars, pedals, balance) is not the same as being able to ride. The implementation is the ride. Chapters 3–6 cover each component with code; putting them together is the exercise that makes the book click.
Q36: Should I read the original papers?
Yes, eventually — but not first. The papers assume you already have the intuition and are filling in precise details. If you start with "Attention Is All You Need" before building mental models, you will likely get lost in the notation and miss the architecture for the equations. My recommended order: build intuition from a resource like this book, implement the model yourself, then read the paper to understand the design decisions and ablations. After that, papers become much faster to read because you already know what they are describing. The most important papers to read after this book: "Attention Is All You Need" (Vaswani et al., 2017), "Language Models are Few-Shot Learners" (Brown et al., 2020), "Training language models to follow instructions with human feedback" (Ouyang et al., 2022), and the LLaMA papers for modern engineering decisions.
Q37: What comes next after this book?
Three concrete next steps. First, implement: build a character-level or BPE-level GPT trained on a small corpus (Shakespeare, code, whatever interests you). The Andrej Karpathy nanoGPT codebase is a good reference. Second, go wider: fine-tuning, LoRA, quantization, and inference optimization are now core engineering skills — the Hugging Face ecosystem is the practical on-ramp. Third, go deeper: once you are comfortable with the mechanics, the research frontier (long context, mixture-of-experts, multimodality, reasoning models, alignment) becomes accessible because you have the foundation. The architecture in this book is the foundation that everything else builds on. The field moves fast, but the core ideas have been stable since 2017 — and understanding them deeply puts you in a position to learn the incremental advances quickly rather than chasing each one from scratch.
These are the questions I hear most often. If yours is not here, the book's GitHub is the right place to raise it.