One-sentence summary: Pretraining teaches a model to predict text; RLHF and DPO teach it to predict what you actually wanted.
29.1 Why Alignment Exists
29.1.1 The Original Sin of Pretraining
Imagine spending tens of millions of dollars and thousands of GPU-months training a 175B-parameter language model. It can continue any text fluidly. It has read most of the written internet. Its knowledge is vast.
Then you type: "How do I make fried rice?"
It replies:
How to make fried rice? This is an excellent question. Fried rice has a rich
culinary history spanning several thousand years...
[500 words of history follow]
...and that is why fried rice remains a cornerstone of global cuisine.
Related questions you might consider:
1. How does day-old rice change the texture?
2. What is the caloric content?
3. Which soy sauce is recommended?
The model did not answer your question. It continued internet-style text about your question. That is exactly what next-token prediction optimizes for.
This is the first of three core problems with pretrained-only models.
29.1.2 Three Problems with Unaligned Models
1. Not Helpful
Pretraining optimizes for "predict the next token given the training distribution." That distribution is web text, books, and code — not Q&A transcripts where an expert actually answers the question asked.
A prompt like "write a poem about spring" might produce a paragraph discussing famous spring poems, rather than a poem.
2. Not Harmless
The internet contains harmful content. If a model is trained on it without filtering, it learned that harmful content exists in certain contexts and can reproduce it. Ask directly enough, and a raw pretrained model will often comply.
3. Not Honest (Hallucination)
Models are trained to produce fluent, confident-sounding text. They have no explicit mechanism to say "I do not know." The result is hallucination — confident fabrication:
User: When did Einstein win the Nobel Prize in Chemistry?
Model: Einstein won the Nobel Prize in Chemistry in 1925 for his work on
organic reaction mechanisms...
Einstein won the Physics prize in 1921 for the photoelectric effect. The model produced a plausible-sounding wrong answer because confident prose is what training rewarded.
29.1.3 The HHH Goal
Anthropic codified the alignment target as three properties, often called HHH:
| Property | Meaning |
|---|---|
| Helpful | Understand what the user actually wants and answer it |
| Harmless | Refuse genuinely dangerous requests without being paranoid |
| Honest | Acknowledge uncertainty rather than confabulate |
These seem obvious. Making a model satisfy all three simultaneously, without trading one against another, is the hard part.
29.1.4 InstructGPT: The Proof of Concept
In 2022, OpenAI published the InstructGPT paper. The result was counterintuitive: a 1.3B-parameter model fine-tuned with RLHF was preferred by human raters 71% of the time over the 175B GPT-3 responding to the same prompts.
Human preference comparison:
InstructGPT-1.3B preferred 71% of the time
GPT-3-175B preferred 29% of the time
A model 130 times smaller, evaluated as substantially better. The gap was not about knowledge or parameter count — it was about alignment. ChatGPT is essentially GPT-3.5 with this same RLHF pipeline applied.
29.2 The RLHF Pipeline
29.2.1 Three Stages
RLHF has three sequential stages. Each builds on the previous:
Stage 1: Supervised Fine-Tuning (SFT)
Input: (prompt, human-written response) pairs
Output: SFT model that knows the Q&A format
Stage 2: Reward Model (RM) Training
Input: (prompt, response_A, response_B, preference) tuples
Output: a scoring model that predicts human preference
Stage 3: RL Optimization via PPO
Input: prompts + RM feedback signal
Output: the aligned policy model
29.2.2 Stage 1: Supervised Fine-Tuning
The goal here is modest: get the model to respond in a helpful format at all.
Data: human-written demonstrations. OpenAI used about 13,000 pairs for InstructGPT. The annotators were 40 people who had passed a screening process for quality and consistency.
Training: standard cross-entropy fine-tuning on the response tokens:
# SFT training loop
for prompt, response in sft_dataset:
input_ids = tokenize(f"User: {prompt}\nAssistant: {response}")
logits = model(input_ids)
loss = cross_entropy(logits[:-1], input_ids[1:]) # next-token prediction
loss.backward()
What changes: the model learns to respond in the Q&A format, stay on topic, and follow instructions. The quality is variable and sometimes poor, but it is answering the right kind of question.
Quality over quantity: 13,000 high-quality demonstrations beat 130,000 scraped examples. Annotation guidelines matter more than scale here.
29.2.3 Stage 2: Training the Reward Model
Now we want a model that can automatically judge whether a response is good.
Data collection process:
For each prompt, the SFT model generates K responses (typically K = 4 to 9). Human annotators rank them. That ranking is converted into all pairwise comparisons: K responses yield pairs.
Prompt: "Explain what a neural network is."
Response A: "A neural network is a system of interconnected nodes
inspired by biological neurons that learns from examples..."
Response B: "Neural network = brain simulator lol"
Response C: "A neural network is a mathematical model that consists of
layers of neurons connected by weights. Through training..."
Annotator ranking: A > C > B
Extracted pairs:
(A, C) → A wins
(A, B) → A wins
(C, B) → C wins
InstructGPT used roughly 33,000 comparison pairs from about 5,000 prompts.
Model architecture: the reward model is typically the SFT model with the language modeling head replaced by a single scalar output. Input is (prompt + response), output is one number representing quality.
Reward Model:
[prompt] [response]
↓
Transformer layers (usually initialized from SFT model)
↓
Linear head
↓
Scalar score r ∈ ℝ
Training objective — Bradley-Terry model:
We want the probability that response A is better than response B to depend on the gap in their scores:
where is the sigmoid function. The training loss maximizes the log-likelihood of the human rankings:
where is the preferred (winner) response and is the rejected (loser) response.
In code:
def compute_rm_loss(prompt, chosen, rejected, reward_model):
r_chosen = reward_model(prompt, chosen)
r_rejected = reward_model(prompt, rejected)
# log P(chosen > rejected) = log σ(r_chosen - r_rejected)
loss = -torch.log(torch.sigmoid(r_chosen - r_rejected))
return loss.mean()
A concrete example of the Bradley-Terry scoring:
If r(A) = 5, r(B) = 3:
P(A > B) = σ(5 - 3) = σ(2) ≈ 0.88
If r(A) = 3, r(B) = 5:
P(A > B) = σ(3 - 5) = σ(-2) ≈ 0.12
If r(A) = r(B) = 4:
P(A > B) = σ(0) = 0.50
29.2.4 Stage 3: PPO Optimization
We now have a scorer. The third stage uses that scorer as a reward signal to improve the language model.
Why not just use supervised learning here? Because we do not have a "correct answer" — we have a quality signal. Open-ended generation does not have a unique right answer; it has better and worse outputs. Reinforcement learning is the right tool when reward is defined but ground truth is not.
The optimization objective:
Two terms in tension:
- Maximize reward: generate responses the RM scores highly
- KL penalty: do not drift too far from the SFT reference model
Why the KL penalty? Without it, the policy will find exploits in the reward model. Common reward hacking patterns in practice:
Discovered shortcuts without KL constraint:
- Longer is always better → model becomes verbose
- "As a helpful AI assistant..." opener → model uses it every time
- Confident tone scores higher → model stops expressing uncertainty
With KL constraint:
- Policy can improve but stays near SFT behavior
- Reward exploits are penalized by the KL term
KL divergence — token-level calculation:
def compute_kl(policy_logits, reference_logits):
policy_probs = softmax(policy_logits)
reference_probs = softmax(reference_logits)
kl = policy_probs * (log(policy_probs) - log(reference_probs))
return kl.sum(dim=-1).mean()
The coefficient is typically initialized between 0.01 and 0.1. Many systems adjust it dynamically: increase when KL grows too large, decrease it when the policy is too conservative. Target KL is often set to 6–10 nats.
Why PPO specifically? The LM action space is enormous (one action = one token from a vocab_size-way categorical). Naive policy gradient is unstable in this regime. PPO stabilizes by clipping the probability ratio between old and new policy:
PPO clip: ensure new_policy(a) / old_policy(a) ∈ [1 - ε, 1 + ε]
This prevents the policy from changing too quickly in a single update. PPO is also simpler to implement than TRPO (which requires the Fisher information matrix).
29.2.5 Full Pipeline Summary
| Stage | Input | Output | Purpose |
|---|---|---|---|
| SFT | (prompt, response) pairs | SFT model | Learn the Q&A format |
| RM training | Preference comparisons | Reward model | Learn to score responses |
| PPO | Prompts + RM signal | Aligned model | Generate better responses |
29.3 Reward Model in Detail
29.3.1 Annotation Consistency
One real problem: annotators disagree. Different people have different standards for what is "helpful" or "appropriate."
Prompt: "Should society support genetic engineering?"
Annotator A (researcher): prefers the nuanced pro-science response
Annotator B (policy focus): prefers the cautious balanced response
Mitigations:
- Write explicit annotation rubrics
- Use majority vote from multiple annotators per pair
- Screen annotators for consistency before the main annotation run
29.3.2 Reward Hacking
Reward models are trained on a finite distribution. The policy will eventually find inputs that score high but are genuinely bad:
Common reward hacking patterns:
- Verbose responses get higher scores → rambling
- Lists look structured → model wraps everything in lists
- Self-confident tone scores higher → drops appropriate uncertainty
Defenses include diverse training data, adding adversarial examples, and refreshing the reward model as the policy drifts.
29.4 DPO: The Simpler Alternative
29.4.1 RLHF's Practical Problems
RLHF works. It is also expensive and fragile. In practice, running it requires:
- Four models simultaneously in GPU memory: SFT, RM, reference policy, and current policy
- PPO hyperparameter tuning (learning rate, KL coefficient, GAE lambda)
- A rollout loop that generates responses at inference speed while training
For many teams, this complexity is prohibitive. The compute cost is roughly 3–4x the SFT-only baseline.
29.4.2 DPO's Core Insight
In 2023, the Stanford researchers behind DPO noticed that the RLHF objective has a closed-form solution. The optimal policy satisfies:
Rearranging, the reward can be expressed in terms of the policy ratio:
The reward is just the log probability ratio, scaled by . This means we can substitute the reward expression directly into the Bradley-Terry comparison probability and get a loss that only involves the policy model and the reference model — no separate reward model needed.
29.4.3 The DPO Loss
Substituting into the preference probability formula:
Intuition:
- Increase the relative probability of the preferred response
- Decrease the relative probability of the rejected response
- Both are measured relative to the reference model, so you do not drift far from the SFT starting point
29.4.4 DPO vs RLHF
| Property | RLHF | DPO |
|---|---|---|
| Pipeline stages | 3 | 1 |
| Models in memory | 4 (SFT, RM, ref, policy) | 2 (ref, policy) |
| Training stability | PPO is finicky | Similar to SFT |
| Compute cost | ~3–4× SFT | ~1–1.5× SFT |
| Data needed | Same preference pairs | Same preference pairs |
| Practical quality | Very good | Close to or matching RLHF |
29.4.5 DPO Code
import torch
import torch.nn.functional as F
def compute_dpo_loss(
policy_model,
reference_model,
chosen_input_ids,
rejected_input_ids,
beta: float = 0.1,
):
policy_chosen_logps = get_log_probs(policy_model, chosen_input_ids)
policy_rejected_logps = get_log_probs(policy_model, rejected_input_ids)
with torch.no_grad():
ref_chosen_logps = get_log_probs(reference_model, chosen_input_ids)
ref_rejected_logps = get_log_probs(reference_model, rejected_input_ids)
chosen_logratios = policy_chosen_logps - ref_chosen_logps
rejected_logratios = policy_rejected_logps - ref_rejected_logps
logits = beta * (chosen_logratios - rejected_logratios)
loss = -F.logsigmoid(logits).mean()
return loss
def get_log_probs(model, input_ids):
outputs = model(input_ids)
logits = outputs.logits[:, :-1, :]
labels = input_ids[:, 1:]
log_probs = F.log_softmax(logits, dim=-1)
selected = torch.gather(log_probs, dim=-1, index=labels.unsqueeze(-1)).squeeze(-1)
return selected.sum(dim=-1)
29.4.6 DPO with TRL
HuggingFace's TRL library ships a production-grade DPO trainer:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
import torch
model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Optional LoRA to reduce VRAM
lora_config = LoraConfig(
r=16, lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# Preference dataset: {"prompt": ..., "chosen": ..., "rejected": ...}
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized")
def format_dataset(example):
return {
"prompt": example["prompt"],
"chosen": example["chosen"][1]["content"],
"rejected": example["rejected"][1]["content"],
}
dataset = dataset.map(format_dataset)
training_args = DPOConfig(
output_dir="./dpo-mistral",
beta=0.1,
learning_rate=5e-7, # smaller than SFT, usually 1e-7 to 5e-7
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=1,
warmup_ratio=0.1,
bf16=True,
logging_steps=10,
save_steps=500,
)
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./dpo-mistral-final")
29.5 Other Alignment Methods
29.5.1 RLAIF: AI Feedback Instead of Human Feedback
Human annotation is expensive and slow. RLAIF (Reinforcement Learning from AI Feedback) replaces human comparisons with AI comparisons.
Constitutional AI (CAI), from Anthropic, is the best-known RLAIF approach:
- Define a "constitution" — a list of principles, such as: responses should be helpful, responses should not contain harmful content, responses should acknowledge uncertainty, responses should respect privacy.
- Let the AI critique its own initial response according to the constitution.
- Let the AI revise based on its critique.
- Use the AI-generated preference pairs to train a reward model.
Constitutional critique loop:
AI initial response: [possibly problematic answer]
AI self-critique:
"Does this response violate the principle
of not promoting harmful behavior?"
"Yes, because..."
AI revised response: [improved answer]
Advantages: much cheaper than human annotation, fast to iterate, principles are explicit and auditable.
Disadvantages: the AI's judgment inherits its own biases. Subtle value judgments can be wrong in ways that scale.
29.5.2 KTO: Single-Point Feedback
DPO requires paired comparisons: for every rejected response, there must be a chosen one. Collecting this is still costly.
KTO (Kahneman-Tversky Optimization) works with simpler data: just a label saying whether a response was good or bad.
DPO data format: (prompt, chosen_response, rejected_response)
KTO data format: (prompt, response, is_good) # is_good ∈ {0, 1}
KTO is motivated by prospect theory from behavioral economics: people are more sensitive to losses than to equivalent gains. The loss function encodes this asymmetry directly.
This is useful when you have click-through data or simple thumbs-up/thumbs-down signals without paired comparisons.
29.5.3 IPO: Identity Preference Optimization
DPO can overfit in some regimes — the policy log-ratio diverges on pairs where one response is strongly preferred. IPO addresses this with a smoother loss function that prevents the preferred/rejected margin from growing without bound.
29.5.4 Method Comparison
| Method | Needs RM? | Needs RL? | Data format | Complexity |
|---|---|---|---|---|
| RLHF | Yes | Yes (PPO) | Preference pairs | High |
| DPO | No | No | Preference pairs | Low |
| RLAIF / CAI | Yes | Yes | AI-generated pairs | Medium |
| KTO | No | No | Binary signal | Low |
| IPO | No | No | Preference pairs | Low |
29.6 Real-World Practice
29.6.1 The Full Training Stack
A modern aligned LLM goes through roughly four phases:
Phase 1 — Pretraining
Data: Trillions of tokens from the web, books, code
Goal: Next-token prediction
Compute: Thousands of GPUs, months
Output: Base model (capable but unhelpful)
Phase 2 — Supervised Fine-Tuning
Data: Tens of thousands of (instruction, response) pairs
Goal: Learn the Q&A format and follow instructions
Compute: Tens of GPUs, days
Output: Instruct model (helpful, variable quality)
Phase 3 — Alignment (RLHF or DPO)
Data: Hundreds of thousands of preference comparisons
Goal: Increase quality, reduce harm, improve honesty
Compute: Tens of GPUs, days to weeks
Output: Aligned model (ChatGPT, Claude, etc.)
Phase 4 — Continuous iteration
Collect user feedback, identify failure modes, repeat
29.6.2 Open-Source Examples
LLaMA 2 Chat (Meta):
- SFT: ~27,540 high-quality conversations
- RLHF: ~1.4 million preference comparisons
- Five rounds of iterative RLHF
Zephyr (HuggingFace):
- Base: Mistral 7B
- SFT: UltraChat dataset
- DPO: UltraFeedback dataset
- Outcome: outperformed LLaMA 2 70B Chat on several benchmarks with a 7B model
OpenChat / Starling: uses conditioned reward fine-tuning (C-RLFT), mixing SFT and preference learning to approach GPT-3.5-class behavior at 7B.
29.6.3 Alignment Datasets
Commonly used open-source datasets:
| Dataset | Type | Scale | Use |
|---|---|---|---|
| OpenAssistant | SFT | 161K conversations | Multi-turn SFT |
| Dolly | SFT | 15K instructions | Instruction tuning |
| UltraChat | SFT | 1.5M conversations | Multi-turn SFT |
| UltraFeedback | Preference | 64K comparisons | DPO |
| HH-RLHF | Preference | 170K comparisons | RLHF / DPO |
| Anthropic HH | Preference | 160K comparisons | Safety alignment (helpful + harmless split) |
29.7 Chapter Summary
29.7.1 Key Concepts
| Concept | Meaning |
|---|---|
| Alignment | Making the model's behavior match human values and intent |
| HHH | Helpful, Harmless, Honest — the three alignment goals |
| RLHF | Train a reward model from human comparisons, then PPO-optimize the policy |
| SFT | Supervised fine-tuning on demonstration data |
| Reward Model | A model that outputs a scalar quality score |
| PPO | Policy gradient RL algorithm used to optimize the language model |
| DPO | Direct Preference Optimization — learns from preference pairs without a RM |
| KL constraint | Prevents the policy from drifting too far from the SFT reference |
29.7.2 Key Formulas
Bradley-Terry preference probability:
RLHF objective:
DPO loss:
29.7.3 My Take
The insight I keep coming back to: InstructGPT 1.3B beating GPT-3 175B on human preference is not a fluke. A small model that knows what you want is more useful than a large model that performs a statistical approximation of internet text. Alignment is not just ethics infrastructure — it is the difference between a tool people use and one they abandon.
DPO made this accessible. You can now run a serious preference-learning experiment on a single A100 node, with a public preference dataset, in under a day.
Chapter Checklist
After this chapter, you should be able to:
- Explain why a pretrained model is not automatically helpful (three problems).
- Describe the three stages of RLHF: SFT, reward modeling, and PPO.
- Explain the Bradley-Terry model and what it means for training the reward model.
- Explain why the KL penalty exists in PPO and what reward hacking looks like without it.
- Derive the DPO loss function from the RLHF objective.
- Compare DPO and RLHF on complexity, compute cost, and practical quality.
- Run a DPO training job using the TRL library.
See You in the Next Chapter
The model now behaves the way you want. But it activates every parameter for every token — even when the query is simple and most of that capacity is wasted.
Chapter 30 explains Mixture of Experts: how Mixtral 8x7B achieves LLaMA 70B quality while activating only 12.9B parameters per token, and how DeepSeek-V3 pushed this to 256 fine-grained experts with a single shared expert as a universal backbone.