One-sentence summary: Pretraining teaches a model to predict text; RLHF and DPO teach it to predict what you actually wanted.

29.1 Why Alignment Exists

29.1.1 The Original Sin of Pretraining

Imagine spending tens of millions of dollars and thousands of GPU-months training a 175B-parameter language model. It can continue any text fluidly. It has read most of the written internet. Its knowledge is vast.

Then you type: "How do I make fried rice?"

It replies:

How to make fried rice? This is an excellent question. Fried rice has a rich
culinary history spanning several thousand years...

[500 words of history follow]

...and that is why fried rice remains a cornerstone of global cuisine.

Related questions you might consider:
1. How does day-old rice change the texture?
2. What is the caloric content?
3. Which soy sauce is recommended?

The model did not answer your question. It continued internet-style text about your question. That is exactly what next-token prediction optimizes for.

This is the first of three core problems with pretrained-only models.

29.1.2 Three Problems with Unaligned Models

1. Not Helpful

Pretraining optimizes for "predict the next token given the training distribution." That distribution is web text, books, and code — not Q&A transcripts where an expert actually answers the question asked.

A prompt like "write a poem about spring" might produce a paragraph discussing famous spring poems, rather than a poem.

2. Not Harmless

The internet contains harmful content. If a model is trained on it without filtering, it learned that harmful content exists in certain contexts and can reproduce it. Ask directly enough, and a raw pretrained model will often comply.

3. Not Honest (Hallucination)

Models are trained to produce fluent, confident-sounding text. They have no explicit mechanism to say "I do not know." The result is hallucination — confident fabrication:

User: When did Einstein win the Nobel Prize in Chemistry?
Model: Einstein won the Nobel Prize in Chemistry in 1925 for his work on
       organic reaction mechanisms...

Einstein won the Physics prize in 1921 for the photoelectric effect. The model produced a plausible-sounding wrong answer because confident prose is what training rewarded.

29.1.3 The HHH Goal

Anthropic codified the alignment target as three properties, often called HHH:

Property	Meaning
Helpful	Understand what the user actually wants and answer it
Harmless	Refuse genuinely dangerous requests without being paranoid
Honest	Acknowledge uncertainty rather than confabulate

These seem obvious. Making a model satisfy all three simultaneously, without trading one against another, is the hard part.

29.1.4 InstructGPT: The Proof of Concept

In 2022, OpenAI published the InstructGPT paper. The result was counterintuitive: a 1.3B-parameter model fine-tuned with RLHF was preferred by human raters 71% of the time over the 175B GPT-3 responding to the same prompts.

Human preference comparison:
  InstructGPT-1.3B  preferred 71%  of the time
  GPT-3-175B        preferred 29%  of the time

A model 130 times smaller, evaluated as substantially better. The gap was not about knowledge or parameter count — it was about alignment. ChatGPT is essentially GPT-3.5 with this same RLHF pipeline applied.

29.2 The RLHF Pipeline

29.2.1 Three Stages

RLHF has three sequential stages. Each builds on the previous:

Stage 1: Supervised Fine-Tuning (SFT)
  Input:  (prompt, human-written response) pairs
  Output: SFT model that knows the Q&A format

Stage 2: Reward Model (RM) Training
  Input:  (prompt, response_A, response_B, preference) tuples
  Output: a scoring model that predicts human preference

Stage 3: RL Optimization via PPO
  Input:  prompts + RM feedback signal
  Output: the aligned policy model

29.2.2 Stage 1: Supervised Fine-Tuning

The goal here is modest: get the model to respond in a helpful format at all.

Data: human-written demonstrations. OpenAI used about 13,000 pairs for InstructGPT. The annotators were 40 people who had passed a screening process for quality and consistency.

Training: standard cross-entropy fine-tuning on the response tokens:

# SFT training loop
for prompt, response in sft_dataset:
    input_ids = tokenize(f"User: {prompt}\nAssistant: {response}")
    logits = model(input_ids)
    loss = cross_entropy(logits[:-1], input_ids[1:])  # next-token prediction
    loss.backward()

What changes: the model learns to respond in the Q&A format, stay on topic, and follow instructions. The quality is variable and sometimes poor, but it is answering the right kind of question.

Quality over quantity: 13,000 high-quality demonstrations beat 130,000 scraped examples. Annotation guidelines matter more than scale here.

29.2.3 Stage 2: Training the Reward Model

Now we want a model that can automatically judge whether a response is good.

Data collection process:

For each prompt, the SFT model generates K responses (typically K = 4 to 9). Human annotators rank them. That ranking is converted into all pairwise comparisons: K responses yield $\binom{K}{2}$ pairs.

Prompt: "Explain what a neural network is."

Response A: "A neural network is a system of interconnected nodes
             inspired by biological neurons that learns from examples..."
Response B: "Neural network = brain simulator lol"
Response C: "A neural network is a mathematical model that consists of
             layers of neurons connected by weights. Through training..."

Annotator ranking: A > C > B

Extracted pairs:
  (A, C) → A wins
  (A, B) → A wins
  (C, B) → C wins

InstructGPT used roughly 33,000 comparison pairs from about 5,000 prompts.

Model architecture: the reward model is typically the SFT model with the language modeling head replaced by a single scalar output. Input is (prompt + response), output is one number representing quality.

Reward Model:
  [prompt] [response]
       ↓
  Transformer layers  (usually initialized from SFT model)
       ↓
  Linear head
       ↓
  Scalar score r ∈ ℝ

Training objective — Bradley-Terry model:

We want the probability that response A is better than response B to depend on the gap in their scores:

P(A \succ B) = \sigma(r(A) - r(B))

where $\sigma$ is the sigmoid function. The training loss maximizes the log-likelihood of the human rankings:

\mathcal{L}_{RM} = -\mathbb{E}\left[\log \sigma(r(y_w) - r(y_l))\right]

where $y_w$ is the preferred (winner) response and $y_l$ is the rejected (loser) response.

In code:

def compute_rm_loss(prompt, chosen, rejected, reward_model):
    r_chosen   = reward_model(prompt, chosen)
    r_rejected = reward_model(prompt, rejected)

    # log P(chosen > rejected) = log σ(r_chosen - r_rejected)
    loss = -torch.log(torch.sigmoid(r_chosen - r_rejected))
    return loss.mean()

A concrete example of the Bradley-Terry scoring:

If r(A) = 5, r(B) = 3:
  P(A > B) = σ(5 - 3) = σ(2) ≈ 0.88

If r(A) = 3, r(B) = 5:
  P(A > B) = σ(3 - 5) = σ(-2) ≈ 0.12

If r(A) = r(B) = 4:
  P(A > B) = σ(0) = 0.50

29.2.4 Stage 3: PPO Optimization

We now have a scorer. The third stage uses that scorer as a reward signal to improve the language model.

Why not just use supervised learning here? Because we do not have a "correct answer" — we have a quality signal. Open-ended generation does not have a unique right answer; it has better and worse outputs. Reinforcement learning is the right tool when reward is defined but ground truth is not.

The optimization objective:

\max_\pi \mathbb{E}\left[r_\theta(x, y)\right] - \beta \cdot \text{KL}(\pi \| \pi_{ref})

Two terms in tension:

Maximize reward: generate responses the RM scores highly
KL penalty: do not drift too far from the SFT reference model

Why the KL penalty? Without it, the policy will find exploits in the reward model. Common reward hacking patterns in practice:

Discovered shortcuts without KL constraint:
  - Longer is always better → model becomes verbose
  - "As a helpful AI assistant..." opener → model uses it every time
  - Confident tone scores higher → model stops expressing uncertainty

With KL constraint:
  - Policy can improve but stays near SFT behavior
  - Reward exploits are penalized by the KL term

KL divergence — token-level calculation:

def compute_kl(policy_logits, reference_logits):
    policy_probs    = softmax(policy_logits)
    reference_probs = softmax(reference_logits)
    kl = policy_probs * (log(policy_probs) - log(reference_probs))
    return kl.sum(dim=-1).mean()

The $\beta$ coefficient is typically initialized between 0.01 and 0.1. Many systems adjust it dynamically: increase $\beta$ when KL grows too large, decrease it when the policy is too conservative. Target KL is often set to 6–10 nats.

Why PPO specifically? The LM action space is enormous (one action = one token from a vocab_size-way categorical). Naive policy gradient is unstable in this regime. PPO stabilizes by clipping the probability ratio between old and new policy:

PPO clip: ensure new_policy(a) / old_policy(a) ∈ [1 - ε, 1 + ε]

This prevents the policy from changing too quickly in a single update. PPO is also simpler to implement than TRPO (which requires the Fisher information matrix).

29.2.5 Full Pipeline Summary

Stage	Input	Output	Purpose
SFT	(prompt, response) pairs	SFT model	Learn the Q&A format
RM training	Preference comparisons	Reward model	Learn to score responses
PPO	Prompts + RM signal	Aligned model	Generate better responses

29.3 Reward Model in Detail

29.3.1 Annotation Consistency

One real problem: annotators disagree. Different people have different standards for what is "helpful" or "appropriate."

Prompt: "Should society support genetic engineering?"

Annotator A (researcher):  prefers the nuanced pro-science response
Annotator B (policy focus): prefers the cautious balanced response

Mitigations:

Write explicit annotation rubrics
Use majority vote from multiple annotators per pair
Screen annotators for consistency before the main annotation run

29.3.2 Reward Hacking

Reward models are trained on a finite distribution. The policy will eventually find inputs that score high but are genuinely bad:

Common reward hacking patterns:
  - Verbose responses get higher scores → rambling
  - Lists look structured → model wraps everything in lists
  - Self-confident tone scores higher → drops appropriate uncertainty

Defenses include diverse training data, adding adversarial examples, and refreshing the reward model as the policy drifts.

29.4 DPO: The Simpler Alternative

29.4.1 RLHF's Practical Problems

RLHF works. It is also expensive and fragile. In practice, running it requires:

Four models simultaneously in GPU memory: SFT, RM, reference policy, and current policy
PPO hyperparameter tuning (learning rate, KL coefficient, GAE lambda)
A rollout loop that generates responses at inference speed while training

For many teams, this complexity is prohibitive. The compute cost is roughly 3–4x the SFT-only baseline.

29.4.2 DPO's Core Insight

In 2023, the Stanford researchers behind DPO noticed that the RLHF objective has a closed-form solution. The optimal policy $\pi^*$ satisfies:

\pi^*(y \mid x) \propto \pi_{ref}(y \mid x) \cdot \exp\!\left(\frac{r(x, y)}{\beta}\right)

Rearranging, the reward can be expressed in terms of the policy ratio:

r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{ref}(y \mid x)} + \text{const}

The reward is just the log probability ratio, scaled by $\beta$ . This means we can substitute the reward expression directly into the Bradley-Terry comparison probability and get a loss that only involves the policy model and the reference model — no separate reward model needed.

29.4.3 The DPO Loss

Substituting into the preference probability formula:

\mathcal{L}_{DPO} = -\mathbb{E}\!\left[\log \sigma\!\left(\beta \left(\log\frac{\pi(y_w \mid x)}{\pi_{ref}(y_w \mid x)} - \log\frac{\pi(y_l \mid x)}{\pi_{ref}(y_l \mid x)}\right)\right)\right]

Intuition:

Increase the relative probability of the preferred response $y_w$
Decrease the relative probability of the rejected response $y_l$
Both are measured relative to the reference model, so you do not drift far from the SFT starting point

29.4.4 DPO vs RLHF

Property	RLHF	DPO
Pipeline stages	3	1
Models in memory	4 (SFT, RM, ref, policy)	2 (ref, policy)
Training stability	PPO is finicky	Similar to SFT
Compute cost	~3–4× SFT	~1–1.5× SFT
Data needed	Same preference pairs	Same preference pairs
Practical quality	Very good	Close to or matching RLHF

29.4.5 DPO Code

import torch
import torch.nn.functional as F

def compute_dpo_loss(
    policy_model,
    reference_model,
    chosen_input_ids,
    rejected_input_ids,
    beta: float = 0.1,
):
    policy_chosen_logps   = get_log_probs(policy_model,    chosen_input_ids)
    policy_rejected_logps = get_log_probs(policy_model,    rejected_input_ids)

    with torch.no_grad():
        ref_chosen_logps   = get_log_probs(reference_model, chosen_input_ids)
        ref_rejected_logps = get_log_probs(reference_model, rejected_input_ids)

    chosen_logratios   = policy_chosen_logps   - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    logits = beta * (chosen_logratios - rejected_logratios)
    loss   = -F.logsigmoid(logits).mean()
    return loss


def get_log_probs(model, input_ids):
    outputs   = model(input_ids)
    logits    = outputs.logits[:, :-1, :]
    labels    = input_ids[:, 1:]
    log_probs = F.log_softmax(logits, dim=-1)
    selected  = torch.gather(log_probs, dim=-1, index=labels.unsqueeze(-1)).squeeze(-1)
    return selected.sum(dim=-1)

29.4.6 DPO with TRL

HuggingFace's TRL library ships a production-grade DPO trainer:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
import torch

model_name = "mistralai/Mistral-7B-v0.1"
model     = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Optional LoRA to reduce VRAM
lora_config = LoraConfig(
    r=16, lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# Preference dataset: {"prompt": ..., "chosen": ..., "rejected": ...}
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized")

def format_dataset(example):
    return {
        "prompt":   example["prompt"],
        "chosen":   example["chosen"][1]["content"],
        "rejected": example["rejected"][1]["content"],
    }

dataset = dataset.map(format_dataset)

training_args = DPOConfig(
    output_dir="./dpo-mistral",
    beta=0.1,
    learning_rate=5e-7,          # smaller than SFT, usually 1e-7 to 5e-7
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    warmup_ratio=0.1,
    bf16=True,
    logging_steps=10,
    save_steps=500,
)

trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./dpo-mistral-final")

29.5 Other Alignment Methods

29.5.1 RLAIF: AI Feedback Instead of Human Feedback

Human annotation is expensive and slow. RLAIF (Reinforcement Learning from AI Feedback) replaces human comparisons with AI comparisons.

Constitutional AI (CAI), from Anthropic, is the best-known RLAIF approach:

Define a "constitution" — a list of principles, such as: responses should be helpful, responses should not contain harmful content, responses should acknowledge uncertainty, responses should respect privacy.
Let the AI critique its own initial response according to the constitution.
Let the AI revise based on its critique.
Use the AI-generated preference pairs to train a reward model.

Constitutional critique loop:

AI initial response: [possibly problematic answer]

AI self-critique:
  "Does this response violate the principle
   of not promoting harmful behavior?"
  "Yes, because..."

AI revised response: [improved answer]

Advantages: much cheaper than human annotation, fast to iterate, principles are explicit and auditable.

Disadvantages: the AI's judgment inherits its own biases. Subtle value judgments can be wrong in ways that scale.

29.5.2 KTO: Single-Point Feedback

DPO requires paired comparisons: for every rejected response, there must be a chosen one. Collecting this is still costly.

KTO (Kahneman-Tversky Optimization) works with simpler data: just a label saying whether a response was good or bad.

DPO data format: (prompt, chosen_response, rejected_response)
KTO data format: (prompt, response, is_good)   # is_good ∈ {0, 1}

KTO is motivated by prospect theory from behavioral economics: people are more sensitive to losses than to equivalent gains. The loss function encodes this asymmetry directly.

This is useful when you have click-through data or simple thumbs-up/thumbs-down signals without paired comparisons.

29.5.3 IPO: Identity Preference Optimization

DPO can overfit in some regimes — the policy log-ratio diverges on pairs where one response is strongly preferred. IPO addresses this with a smoother loss function that prevents the preferred/rejected margin from growing without bound.

29.5.4 Method Comparison

Method	Needs RM?	Needs RL?	Data format	Complexity
RLHF	Yes	Yes (PPO)	Preference pairs	High
DPO	No	No	Preference pairs	Low
RLAIF / CAI	Yes	Yes	AI-generated pairs	Medium
KTO	No	No	Binary signal	Low
IPO	No	No	Preference pairs	Low

29.6 Real-World Practice

29.6.1 The Full Training Stack

A modern aligned LLM goes through roughly four phases:

Phase 1 — Pretraining
  Data:    Trillions of tokens from the web, books, code
  Goal:    Next-token prediction
  Compute: Thousands of GPUs, months
  Output:  Base model (capable but unhelpful)

Phase 2 — Supervised Fine-Tuning
  Data:    Tens of thousands of (instruction, response) pairs
  Goal:    Learn the Q&A format and follow instructions
  Compute: Tens of GPUs, days
  Output:  Instruct model (helpful, variable quality)

Phase 3 — Alignment (RLHF or DPO)
  Data:    Hundreds of thousands of preference comparisons
  Goal:    Increase quality, reduce harm, improve honesty
  Compute: Tens of GPUs, days to weeks
  Output:  Aligned model (ChatGPT, Claude, etc.)

Phase 4 — Continuous iteration
  Collect user feedback, identify failure modes, repeat

29.6.2 Open-Source Examples

LLaMA 2 Chat (Meta):

SFT: ~27,540 high-quality conversations
RLHF: ~1.4 million preference comparisons
Five rounds of iterative RLHF

Zephyr (HuggingFace):

Base: Mistral 7B
SFT: UltraChat dataset
DPO: UltraFeedback dataset
Outcome: outperformed LLaMA 2 70B Chat on several benchmarks with a 7B model

OpenChat / Starling: uses conditioned reward fine-tuning (C-RLFT), mixing SFT and preference learning to approach GPT-3.5-class behavior at 7B.

29.6.3 Alignment Datasets

Commonly used open-source datasets:

Dataset	Type	Scale	Use
OpenAssistant	SFT	161K conversations	Multi-turn SFT
Dolly	SFT	15K instructions	Instruction tuning
UltraChat	SFT	1.5M conversations	Multi-turn SFT
UltraFeedback	Preference	64K comparisons	DPO
HH-RLHF	Preference	170K comparisons	RLHF / DPO
Anthropic HH	Preference	160K comparisons	Safety alignment (helpful + harmless split)

29.7 Chapter Summary

29.7.1 Key Concepts

Concept	Meaning
Alignment	Making the model's behavior match human values and intent
HHH	Helpful, Harmless, Honest — the three alignment goals
RLHF	Train a reward model from human comparisons, then PPO-optimize the policy
SFT	Supervised fine-tuning on demonstration data
Reward Model	A model that outputs a scalar quality score
PPO	Policy gradient RL algorithm used to optimize the language model
DPO	Direct Preference Optimization — learns from preference pairs without a RM
KL constraint	Prevents the policy from drifting too far from the SFT reference

29.7.2 Key Formulas

Bradley-Terry preference probability:

P(A \succ B) = \sigma(r(A) - r(B))

RLHF objective:

\max_\pi \mathbb{E}[r(x, y)] - \beta \cdot \text{KL}(\pi \| \pi_{ref})

DPO loss:

\mathcal{L}_{DPO} = -\mathbb{E}\!\left[\log \sigma\!\left(\beta \left(\log\frac{\pi(y_w \mid x)}{\pi_{ref}(y_w \mid x)} - \log\frac{\pi(y_l \mid x)}{\pi_{ref}(y_l \mid x)}\right)\right)\right]

29.7.3 My Take

The insight I keep coming back to: InstructGPT 1.3B beating GPT-3 175B on human preference is not a fluke. A small model that knows what you want is more useful than a large model that performs a statistical approximation of internet text. Alignment is not just ethics infrastructure — it is the difference between a tool people use and one they abandon.

DPO made this accessible. You can now run a serious preference-learning experiment on a single A100 node, with a public preference dataset, in under a day.

Chapter Checklist

After this chapter, you should be able to:

Explain why a pretrained model is not automatically helpful (three problems).
Describe the three stages of RLHF: SFT, reward modeling, and PPO.
Explain the Bradley-Terry model and what it means for training the reward model.
Explain why the KL penalty exists in PPO and what reward hacking looks like without it.
Derive the DPO loss function from the RLHF objective.
Compare DPO and RLHF on complexity, compute cost, and practical quality.
Run a DPO training job using the TRL library.

See You in the Next Chapter

The model now behaves the way you want. But it activates every parameter for every token — even when the query is simple and most of that capacity is wasted.

Chapter 30 explains Mixture of Experts: how Mixtral 8x7B achieves LLaMA 70B quality while activating only 12.9B parameters per token, and how DeepSeek-V3 pushed this to 256 fine-grained experts with a single shared expert as a universal backbone.