One-sentence summary: Decoding strategy is the choice that converts a probability distribution over the entire vocabulary into a single token — greedy picks the peak, sampling rolls the dice, temperature reshapes the curve, and top-k/top-p trim the tails.
B.1 Why Decoding Strategy Matters
B.1.1 From Logits to Token
Every forward pass through a Transformer decoder ends at the same place: a logits vector with one entry per token in the vocabulary.
logits = [2.1, -0.5, 1.3, 0.8, ..., -1.2] # length = vocab_size
Each value is an unnormalized score. Before you can sample, you convert to a probability distribution through Softmax:
Step one — Softmax — is fixed. Step two — picking a token from that distribution — is where decoding strategy lives. This appendix covers every practical option.
B.1.2 Strategy Trade-offs at a Glance
| Strategy | Deterministic | Diversity | Risk |
|---|---|---|---|
| Greedy | high | low | repetition, monotony |
| Random sampling | low | high | incoherent tokens |
| Top-K | medium | medium | fixed K mismatches distribution shape |
| Top-P | medium | medium | well-balanced in practice |
| Beam search | high | low | expensive, safe but dull |
B.2 Greedy Decoding
B.2.1 How It Works
The simplest strategy: pick the highest-probability token at every step.
import torch
import torch.nn.functional as F
def greedy_decode(logits: torch.Tensor) -> int:
"""
logits: 1-D tensor of shape (vocab_size,)
Returns the token ID with the highest logit.
"""
return torch.argmax(logits).item()
You do not even need Softmax here — argmax on logits gives the same answer as argmax on probabilities, because Softmax is monotone.
B.2.2 Worked Example
Prompt: "The agent opened a pull"
Step 1:
logits → softmax → {request: 0.51, comment: 0.22, merge: 0.12, ...}
greedy picks: "request"
Prompt: "The agent opened a pull request"
Step 2:
logits → softmax → {and: 0.34, .": 0.29, to: 0.18, ...}
greedy picks: "and"
Final output: "The agent opened a pull request and ..."
Same input always produces the same output. No randomness.
B.2.3 When to Use Greedy
Good fit:
- Code completion where you need a single correct answer
- Factual lookups and structured extraction
- Any pipeline where downstream code parses the output deterministically
Poor fit:
- Creative generation — greedy loops badly once it falls into a high-probability rut
- Dialogue where response variety matters
- Tasks where the globally best sequence is not the locally best token at each step
B.2.4 Greedy's Failure Mode: Repetition
Because greedy always follows the mode of the distribution, it can spiral into loops like:
"The function returns the value. The function returns the value. The function..."
This is not a bug in the model — it is a bug in the decoding choice. The repetition penalties in section B.9 exist precisely to break this.
B.3 Random Sampling
B.3.1 How It Works
Instead of taking the argmax, draw one token at random from the full probability distribution.
import torch
import torch.nn.functional as F
def random_sample(logits: torch.Tensor) -> int:
"""
logits: 1-D tensor of shape (vocab_size,)
Returns a token ID sampled from the softmax distribution.
"""
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, num_samples=1).item()
B.3.2 Worked Example
probs = {request: 0.40, comment: 0.30, review: 0.15, merge: 0.10, issue: 0.05}
Each run might return a different token:
- "request" (40% probability)
- "comment" (30% probability)
- "review" (15% probability)
- "merge" (10% probability)
- "issue" ( 5% probability)
The same prompt can produce different completions on every call.
B.3.3 Trade-offs
Advantages:
- Diverse, varied outputs
- Avoids deterministic repetition loops
- Reflects the model's full uncertainty about what comes next
Disadvantages:
- Low-probability tokens can be selected, producing incoherent text
- Outputs are not reproducible without fixing the random seed
- Unmodified raw sampling is rarely used in production; you almost always pair it with temperature, top-k, or top-p
B.4 Temperature
B.4.1 How It Works
Temperature rescales the logits before Softmax, sharpening or flattening the distribution:
import torch
import torch.nn.functional as F
def softmax_with_temperature(logits: torch.Tensor, temperature: float) -> torch.Tensor:
"""
temperature > 1 → flatter distribution (more random)
temperature < 1 → sharper distribution (more deterministic)
temperature → 0 → equivalent to greedy (argmax)
"""
scaled_logits = logits / temperature
return F.softmax(scaled_logits, dim=-1)
The math is straightforward: divide every logit by T before exponentiating. High T suppresses differences between logits; low T amplifies them.
B.4.2 Numerical Table
Using a fixed logits array [2.0, 1.0, 0.5]:
| Temperature | Softmax result (approx.) | Character |
|---|---|---|
| 0.1 | [1.00, 0.00, 0.00] | almost deterministic |
| 0.5 | [0.84, 0.11, 0.04] | strong preference for top token |
| 1.0 | [0.63, 0.23, 0.14] | raw model distribution |
| 2.0 | [0.48, 0.29, 0.23] | noticeably flatter |
| 10.0 | [0.36, 0.33, 0.31] | nearly uniform |
Note: values are
softmax(logits / T), rounded to two decimal places.
Intuition:
- T < 1: the model becomes more decisive — the likely token gets even more probability mass
- T = 1: no change; the raw distribution is preserved
- T > 1: the model becomes more exploratory — low-probability tokens get a bigger slice
B.4.3 Choosing Temperature
| Temperature | Effect | Typical use |
|---|---|---|
| 0 (or → 0) | greedy, fully deterministic | code generation, structured extraction |
| 0.1 – 0.3 | very confident | factual Q&A, retrieval |
| 0.5 – 0.7 | confident with variation | general assistant dialogue |
| 0.8 – 1.0 | balanced | creative writing with quality guardrails |
| 1.0 – 1.5 | exploratory | brainstorming, story drafts |
| > 1.5 | highly random | experimental only — output often degrades |
B.4.4 Temperature = 0
As T approaches zero, the scaled logits diverge to ±∞, and Softmax collapses to a point mass on the argmax. Many APIs treat temperature=0 as an exact synonym for greedy decoding.
B.5 Top-K Sampling
B.5.1 How It Works
Keep only the K tokens with the highest logits, zero out the rest, then sample from the survivors:
import torch
import torch.nn.functional as F
def top_k_sample(logits: torch.Tensor, k: int, temperature: float = 1.0) -> int:
"""
logits: 1-D tensor of shape (vocab_size,)
k: number of top tokens to keep
temperature: applied before softmax over the kept tokens
"""
# 1. Find the k-th largest logit value (threshold)
top_k_values, top_k_indices = torch.topk(logits, k)
# 2. Apply temperature and normalize over the top-k only
top_k_probs = F.softmax(top_k_values / temperature, dim=-1)
# 3. Sample from the reduced distribution
sampled_pos = torch.multinomial(top_k_probs, num_samples=1).item()
return top_k_indices[sampled_pos].item()
B.5.2 Worked Example
Full distribution:
A: 0.40, B: 0.30, C: 0.15, D: 0.08, E: 0.05, F: 0.02
Top-K with K=3:
Kept: A: 0.40, B: 0.30, C: 0.15
Renormalized: A: 0.47, B: 0.35, C: 0.18
Sample only from {A, B, C} — D, E, F are excluded entirely.
B.5.3 Choosing K
| K value | Effect |
|---|---|
| 1 | equivalent to greedy |
| 10 – 50 | common production range |
| 100+ | approaches unconstrained sampling |
Rule of thumb: K = 40 to 50 is the default in many open-source inference stacks (LLaMA, Mistral defaults).
B.5.4 Top-K's Limitation
K is a fixed number, but the distribution shape varies wildly across token positions. Consider two extremes:
Peaked distribution:
A: 0.95, B: 0.03, C: 0.01, D: 0.005, ...
K=50 keeps 49 tokens with a combined probability < 0.05.
Those tail tokens should not be candidates at all.
Flat distribution:
A: 0.10, B: 0.09, C: 0.08, D: 0.08, E: 0.07, ...
K=50 may cut off tokens that are perfectly reasonable alternatives.
Fixed K does not adapt to the entropy of the distribution. That is the problem Top-P was designed to solve.
B.6 Top-P (Nucleus) Sampling
B.6.1 How It Works
Top-P keeps the smallest set of tokens whose cumulative probability reaches P, then samples from that set. The candidate set size adapts automatically.
import torch
import torch.nn.functional as F
def top_p_sample(logits: torch.Tensor, p: float, temperature: float = 1.0) -> int:
"""
logits: 1-D tensor of shape (vocab_size,)
p: cumulative probability threshold (e.g. 0.9)
temperature: applied before softmax
"""
# 1. Apply temperature and compute probabilities
probs = F.softmax(logits / temperature, dim=-1)
# 2. Sort descending
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
# 3. Cumulative sum
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
# 4. Find the first index where cumulative >= p
# Shift by one so we always include the token that pushes us over p
cutoff_mask = (cumulative_probs - sorted_probs) >= p
sorted_probs[cutoff_mask] = 0.0
# 5. Renormalize and sample
sorted_probs = sorted_probs / sorted_probs.sum()
sampled_pos = torch.multinomial(sorted_probs, num_samples=1).item()
return sorted_indices[sampled_pos].item()
B.6.2 Worked Example
Sorted distribution:
A: 0.40, B: 0.30, C: 0.15, D: 0.08, E: 0.05, F: 0.02
Cumulative:
0.40, 0.70, 0.85, 0.93, 0.98, 1.00
Top-P = 0.90:
Keep A, B, C, D (cumulative 0.93 ≥ 0.90).
Nucleus size = 4.
B.6.3 Top-P Adapts to Distribution Shape
This is what Top-K cannot do:
Peaked distribution:
A: 0.95, B: 0.03, ...
Top-P = 0.90 → nucleus = {A} only (size 1)
Flat distribution:
A: 0.10, B: 0.09, C: 0.08, ...
Top-P = 0.90 → nucleus = ~15 tokens (size adjusts automatically)
The nucleus contracts when the model is confident and expands when the model is uncertain. That is the right behavior.
B.6.4 Choosing P
| P value | Effect |
|---|---|
| 0.1 – 0.5 | small nucleus, conservative |
| 0.8 – 0.95 | standard production range |
| 1.0 | equivalent to unconstrained sampling |
Rule of thumb: P = 0.9 or P = 0.95 is the default in most production configurations. The Holtzman et al. 2019 paper that introduced nucleus sampling recommended P = 0.95 for open-ended generation.
B.7 Beam Search
B.7.1 How It Works
Instead of committing to one token at each step, beam search maintains B candidate sequences in parallel — the "beam". At every step, each beam is extended by every possible next token, and only the top-B sequences (by cumulative log-probability) survive.
B.7.2 Algorithm
import math
from typing import List, Tuple
def beam_search(
model,
prompt: List[int],
beam_width: int,
max_length: int,
alpha: float = 0.6, # length normalization exponent
) -> List[int]:
"""
Returns the highest-scoring sequence under length-normalized beam search.
model: callable that takes token IDs and returns log-probabilities over vocab.
"""
# Each beam is (sequence, cumulative_log_prob)
beams: List[Tuple[List[int], float]] = [(prompt[:], 0.0)]
completed: List[Tuple[List[int], float]] = []
for step in range(max_length):
all_candidates: List[Tuple[List[int], float]] = []
for seq, score in beams:
log_probs = model(seq) # shape: (vocab_size,)
# Expand: consider every possible next token
for token_id, lp in enumerate(log_probs):
new_seq = seq + [token_id]
new_score = score + lp
all_candidates.append((new_seq, new_score))
# Apply length normalization before ranking
def length_normalized_score(candidate):
seq, score = candidate
length = len(seq) - len(prompt)
return score / (max(length, 1) ** alpha)
all_candidates.sort(key=length_normalized_score, reverse=True)
beams = all_candidates[:beam_width]
# Return the top sequence
return beams[0][0]
B.7.3 Length Normalization
Without correction, beam search favors shorter sequences because each step multiplies probabilities (< 1), so longer sequences accumulate lower raw scores. The standard fix:
where is a common default (used in Google's original Neural Machine Translation system). Higher α penalizes length more aggressively.
B.7.4 When Beam Search Is Appropriate
Strong fit:
- Machine translation (beam width 4–6 is standard)
- Abstractive summarization
- Structured generation with a known grammar (constrained decoding)
Poor fit:
- Open-ended chat and instruction following — beam search produces fluent but bland outputs that avoid all risk
- Any task where diversity across multiple generations matters
Modern LLM serving stacks (vLLM, TGI, llama.cpp) do not use beam search for chat models. They use Top-P + Temperature sampling. Beam search remains in NLP pipelines where the output has a single objectively correct form.
B.8 Combination Strategies
B.8.1 The Two Common Combinations
In practice you almost always combine multiple strategies. Here are the two standard patterns:
import torch
import torch.nn.functional as F
# --- Helper: apply Top-K filter ---
def top_k_filter(logits: torch.Tensor, k: int) -> torch.Tensor:
"""Zero out all logits except the top k."""
if k <= 0:
return logits
top_k_values = torch.topk(logits, k).values
threshold = top_k_values[-1]
return torch.where(logits >= threshold, logits, torch.full_like(logits, float('-inf')))
# --- Helper: apply Top-P filter ---
def top_p_filter(logits: torch.Tensor, p: float) -> torch.Tensor:
"""Zero out all logits outside the nucleus."""
probs = F.softmax(logits, dim=-1)
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative = torch.cumsum(sorted_probs, dim=-1)
mask = (cumulative - sorted_probs) >= p
sorted_probs[mask] = 0.0
# Scatter back to original order
filtered = torch.zeros_like(probs)
filtered.scatter_(0, sorted_indices, sorted_probs)
return torch.log(filtered + 1e-10) # back to logit space (approx)
# --- Pattern 1: Top-P + Temperature ---
def decode_top_p_temperature(
logits: torch.Tensor,
p: float = 0.9,
temperature: float = 0.7,
) -> int:
"""Standard open-ended generation config."""
filtered = top_p_filter(logits, p)
probs = F.softmax(filtered / temperature, dim=-1)
return torch.multinomial(probs, num_samples=1).item()
# --- Pattern 2: Top-K + Top-P + Temperature ---
def decode_top_k_top_p_temperature(
logits: torch.Tensor,
k: int = 50,
p: float = 0.9,
temperature: float = 0.7,
) -> int:
"""Hugging Face generate() default-style config."""
# First narrow to top-k, then apply nucleus threshold, then temperature
filtered = top_k_filter(logits, k)
filtered = top_p_filter(filtered, p)
probs = F.softmax(filtered / temperature, dim=-1)
return torch.multinomial(probs, num_samples=1).item()
The ordering matters: Top-K first, then Top-P. This way Top-K sets a hard ceiling on candidates and Top-P can only make the nucleus smaller, never larger.
B.8.2 Default Configurations in the Wild
Different inference stacks ship different defaults. These are the de facto settings you will encounter when you first hit an API:
| Stack / API | temperature | top_p | top_k | Notes |
|---|---|---|---|---|
| OpenAI API (chat) | 1.0 | 1.0 | — | Both defaults are "off"; users expected to tune |
| Anthropic Claude API | 1.0 | — | — | Default is effectively greedy-ish with temperature 1 |
Hugging Face generate() | 1.0 | 0.9 | 50 | Top-K + Top-P both active by default |
| llama.cpp | 0.8 | 0.95 | 40 | Conservative defaults for local use |
| LLaMA reference inference | 0.6 | 0.9 | — | Meta's published defaults for Llama 3 |
| Mistral reference | 0.7 | 1.0 | 50 | Top-K active, Top-P off by default |
B.8.3 Recommended Starting Points by Task
| Task | Temperature | Top-P | Top-K | Notes |
|---|---|---|---|---|
| Code generation | 0.2 | 0.95 | — | Low T, wide nucleus |
| Factual Q&A | 0 or 0.3 | — | — | Greedy or near-greedy |
| Instruction following | 0.7 | 0.9 | — | Standard assistant config |
| Creative writing | 0.9 – 1.0 | 0.9 | — | Let the model explore |
| Brainstorming | 1.0 – 1.2 | 0.95 | — | More temperature, wider nucleus |
| Machine translation | beam_width=4, α=0.6 | — | — | Beam search still preferred |
B.9 Repetition Penalty
B.9.1 The Problem
LLMs can fall into loops:
"The agent opened the pull request. The agent opened the pull request.
The agent opened the pull request. The agent..."
This is a greedy decoding failure mode, but it also appears in sampling when the high-probability token happens to be the one that was just generated. The fix is to penalize tokens that have already appeared.
B.9.2 Implementation
import torch
def apply_repetition_penalty(
logits: torch.Tensor,
generated_token_ids: list,
penalty: float = 1.2,
) -> torch.Tensor:
"""
For each token that has already been generated, reduce its logit.
Tokens with positive logits are divided by penalty (reducing them).
Tokens with negative logits are multiplied by penalty (pushing them more negative).
penalty = 1.0 → no change.
penalty = 1.2 → common default (approximately 17% reduction in probability mass).
"""
logits = logits.clone()
for token_id in set(generated_token_ids):
if logits[token_id] > 0:
logits[token_id] /= penalty
else:
logits[token_id] *= penalty
return logits
The set() deduplication means frequency does not matter — seeing a token once applies the same penalty as seeing it ten times. For frequency-sensitive behavior, see B.11.
B.10 Presence Penalty
B.10.1 What It Does
Presence penalty applies a constant subtraction to every token that has appeared anywhere in the generated text, regardless of how many times it appeared.
import torch
def apply_presence_penalty(
logits: torch.Tensor,
generated_token_ids: list,
presence_penalty: float = 0.6,
) -> torch.Tensor:
"""
Subtract presence_penalty from the logit of every token seen at least once.
presence_penalty = 0 → no effect
presence_penalty = 0.6 → common default (OpenAI API)
presence_penalty = 2.0 → strong avoidance of any repeated token
"""
logits = logits.clone()
for token_id in set(generated_token_ids):
logits[token_id] -= presence_penalty
return logits
This is the OpenAI API presence_penalty parameter. Presence penalty encourages the model to introduce new topics — it does not scale with repetition count.
B.11 Frequency Penalty
B.11.1 What It Does
Frequency penalty applies a penalty proportional to how many times each token has already appeared. The more a token repeats, the more it is suppressed.
import torch
from collections import Counter
def apply_frequency_penalty(
logits: torch.Tensor,
generated_token_ids: list,
frequency_penalty: float = 0.5,
) -> torch.Tensor:
"""
Subtract frequency_penalty * count[token] from each token's logit.
frequency_penalty = 0 → no effect
frequency_penalty = 0.5 → common default
frequency_penalty = 2.0 → strong suppression of repeated tokens
"""
logits = logits.clone()
token_counts = Counter(generated_token_ids)
for token_id, count in token_counts.items():
logits[token_id] -= frequency_penalty * count
return logits
This is the OpenAI API frequency_penalty parameter. Unlike presence penalty, frequency penalty keeps subtracting the more times a token appears — a token seen ten times gets ten times the penalty of a token seen once.
B.11.2 Penalty Parameter Reference
| Parameter | Range | Effect at 0 | Common default |
|---|---|---|---|
repetition_penalty | 1.0 – 1.5 | no change (multiplicative) | 1.2 |
presence_penalty | 0 – 2.0 | no change | 0.6 |
frequency_penalty | 0 – 2.0 | no change | 0.5 |
All three can be combined. In practice, presence penalty and frequency penalty are more common in API-facing products; repetition penalty is more common in open-source local inference stacks.
B.12 Appendix Summary
| Strategy | Core idea | When to use |
|---|---|---|
| Greedy | argmax every step | code, deterministic tasks |
| Random sampling | multinomial draw | rarely alone; base for the rest |
| Temperature | scale logits by 1/T before softmax | always; tune first |
| Top-K | keep top K candidates | paired with Top-P for belt-and-suspenders |
| Top-P | keep nucleus at cumulative P | preferred over Top-K alone |
| Beam search | track B best sequences | translation, constrained generation |
| Repetition penalty | multiplicative penalty on seen tokens | local inference defaults |
| Presence penalty | flat subtract for any seen token | API-level topic diversity |
| Frequency penalty | count-scaled subtract for seen tokens | API-level repetition control |
Further Reading
- The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) — the paper that introduced nucleus (Top-P) sampling and named the degeneration problem
- Hierarchical Neural Story Generation (Fan et al., 2018) — original Top-K sampling applied to story generation
- CTRL: A Conditional Transformer Language Model for Controllable Generation (Keskar et al., 2019) — discusses repetition penalty in the context of controllable generation
Decoding decisions are a product surface. The defaults your serving stack ships with will quietly shape how your model feels to users. Appendix C answers the questions that come up most often.