One-sentence summary: Decoding strategy is the choice that converts a probability distribution over the entire vocabulary into a single token — greedy picks the peak, sampling rolls the dice, temperature reshapes the curve, and top-k/top-p trim the tails.


B.1 Why Decoding Strategy Matters

B.1.1 From Logits to Token

Every forward pass through a Transformer decoder ends at the same place: a logits vector with one entry per token in the vocabulary.

logits = [2.1, -0.5, 1.3, 0.8, ..., -1.2]   # length = vocab_size

Each value is an unnormalized score. Before you can sample, you convert to a probability distribution through Softmax:

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

Step one — Softmax — is fixed. Step two — picking a token from that distribution — is where decoding strategy lives. This appendix covers every practical option.

B.1.2 Strategy Trade-offs at a Glance

StrategyDeterministicDiversityRisk
Greedyhighlowrepetition, monotony
Random samplinglowhighincoherent tokens
Top-Kmediummediumfixed K mismatches distribution shape
Top-Pmediummediumwell-balanced in practice
Beam searchhighlowexpensive, safe but dull

B.2 Greedy Decoding

B.2.1 How It Works

The simplest strategy: pick the highest-probability token at every step.

import torch
import torch.nn.functional as F

def greedy_decode(logits: torch.Tensor) -> int:
    """
    logits: 1-D tensor of shape (vocab_size,)
    Returns the token ID with the highest logit.
    """
    return torch.argmax(logits).item()

You do not even need Softmax here — argmax on logits gives the same answer as argmax on probabilities, because Softmax is monotone.

B.2.2 Worked Example

Prompt: "The agent opened a pull"
Step 1:
  logits  softmax  {request: 0.51, comment: 0.22, merge: 0.12, ...}
  greedy picks: "request"

Prompt: "The agent opened a pull request"
Step 2:
  logits  softmax  {and: 0.34, .": 0.29, to: 0.18, ...}
  greedy picks: "and"

Final output: "The agent opened a pull request and ..."

Same input always produces the same output. No randomness.

B.2.3 When to Use Greedy

Good fit:

  • Code completion where you need a single correct answer
  • Factual lookups and structured extraction
  • Any pipeline where downstream code parses the output deterministically

Poor fit:

  • Creative generation — greedy loops badly once it falls into a high-probability rut
  • Dialogue where response variety matters
  • Tasks where the globally best sequence is not the locally best token at each step

B.2.4 Greedy's Failure Mode: Repetition

Because greedy always follows the mode of the distribution, it can spiral into loops like:

"The function returns the value. The function returns the value. The function..."

This is not a bug in the model — it is a bug in the decoding choice. The repetition penalties in section B.9 exist precisely to break this.


B.3 Random Sampling

B.3.1 How It Works

Instead of taking the argmax, draw one token at random from the full probability distribution.

import torch
import torch.nn.functional as F

def random_sample(logits: torch.Tensor) -> int:
    """
    logits: 1-D tensor of shape (vocab_size,)
    Returns a token ID sampled from the softmax distribution.
    """
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()

B.3.2 Worked Example

probs = {request: 0.40, comment: 0.30, review: 0.15, merge: 0.10, issue: 0.05}

Each run might return a different token:
  - "request"  (40% probability)
  - "comment"  (30% probability)
  - "review"   (15% probability)
  - "merge"    (10% probability)
  - "issue"    ( 5% probability)

The same prompt can produce different completions on every call.

B.3.3 Trade-offs

Advantages:

  • Diverse, varied outputs
  • Avoids deterministic repetition loops
  • Reflects the model's full uncertainty about what comes next

Disadvantages:

  • Low-probability tokens can be selected, producing incoherent text
  • Outputs are not reproducible without fixing the random seed
  • Unmodified raw sampling is rarely used in production; you almost always pair it with temperature, top-k, or top-p

B.4 Temperature

B.4.1 How It Works

Temperature rescales the logits before Softmax, sharpening or flattening the distribution:

import torch
import torch.nn.functional as F

def softmax_with_temperature(logits: torch.Tensor, temperature: float) -> torch.Tensor:
    """
    temperature > 1  →  flatter distribution (more random)
    temperature < 1  →  sharper distribution (more deterministic)
    temperature → 0  →  equivalent to greedy (argmax)
    """
    scaled_logits = logits / temperature
    return F.softmax(scaled_logits, dim=-1)

The math is straightforward: divide every logit by T before exponentiating. High T suppresses differences between logits; low T amplifies them.

B.4.2 Numerical Table

Using a fixed logits array [2.0, 1.0, 0.5]:

TemperatureSoftmax result (approx.)Character
0.1[1.00, 0.00, 0.00]almost deterministic
0.5[0.84, 0.11, 0.04]strong preference for top token
1.0[0.63, 0.23, 0.14]raw model distribution
2.0[0.48, 0.29, 0.23]noticeably flatter
10.0[0.36, 0.33, 0.31]nearly uniform

Note: values are softmax(logits / T), rounded to two decimal places.

Intuition:

  • T < 1: the model becomes more decisive — the likely token gets even more probability mass
  • T = 1: no change; the raw distribution is preserved
  • T > 1: the model becomes more exploratory — low-probability tokens get a bigger slice

B.4.3 Choosing Temperature

TemperatureEffectTypical use
0 (or → 0)greedy, fully deterministiccode generation, structured extraction
0.1 – 0.3very confidentfactual Q&A, retrieval
0.5 – 0.7confident with variationgeneral assistant dialogue
0.8 – 1.0balancedcreative writing with quality guardrails
1.0 – 1.5exploratorybrainstorming, story drafts
> 1.5highly randomexperimental only — output often degrades

B.4.4 Temperature = 0

As T approaches zero, the scaled logits diverge to ±∞, and Softmax collapses to a point mass on the argmax. Many APIs treat temperature=0 as an exact synonym for greedy decoding.


B.5 Top-K Sampling

B.5.1 How It Works

Keep only the K tokens with the highest logits, zero out the rest, then sample from the survivors:

import torch
import torch.nn.functional as F

def top_k_sample(logits: torch.Tensor, k: int, temperature: float = 1.0) -> int:
    """
    logits:      1-D tensor of shape (vocab_size,)
    k:           number of top tokens to keep
    temperature: applied before softmax over the kept tokens
    """
    # 1. Find the k-th largest logit value (threshold)
    top_k_values, top_k_indices = torch.topk(logits, k)

    # 2. Apply temperature and normalize over the top-k only
    top_k_probs = F.softmax(top_k_values / temperature, dim=-1)

    # 3. Sample from the reduced distribution
    sampled_pos = torch.multinomial(top_k_probs, num_samples=1).item()
    return top_k_indices[sampled_pos].item()

B.5.2 Worked Example

Full distribution:
  A: 0.40, B: 0.30, C: 0.15, D: 0.08, E: 0.05, F: 0.02

Top-K with K=3:
  Kept:        A: 0.40, B: 0.30, C: 0.15
  Renormalized: A: 0.47, B: 0.35, C: 0.18

Sample only from {A, B, C}  D, E, F are excluded entirely.

B.5.3 Choosing K

K valueEffect
1equivalent to greedy
10 – 50common production range
100+approaches unconstrained sampling

Rule of thumb: K = 40 to 50 is the default in many open-source inference stacks (LLaMA, Mistral defaults).

B.5.4 Top-K's Limitation

K is a fixed number, but the distribution shape varies wildly across token positions. Consider two extremes:

Peaked distribution:

A: 0.95, B: 0.03, C: 0.01, D: 0.005, ...
K=50 keeps 49 tokens with a combined probability < 0.05.
Those tail tokens should not be candidates at all.

Flat distribution:

A: 0.10, B: 0.09, C: 0.08, D: 0.08, E: 0.07, ...
K=50 may cut off tokens that are perfectly reasonable alternatives.

Fixed K does not adapt to the entropy of the distribution. That is the problem Top-P was designed to solve.


B.6 Top-P (Nucleus) Sampling

B.6.1 How It Works

Top-P keeps the smallest set of tokens whose cumulative probability reaches P, then samples from that set. The candidate set size adapts automatically.

import torch
import torch.nn.functional as F

def top_p_sample(logits: torch.Tensor, p: float, temperature: float = 1.0) -> int:
    """
    logits:      1-D tensor of shape (vocab_size,)
    p:           cumulative probability threshold (e.g. 0.9)
    temperature: applied before softmax
    """
    # 1. Apply temperature and compute probabilities
    probs = F.softmax(logits / temperature, dim=-1)

    # 2. Sort descending
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)

    # 3. Cumulative sum
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # 4. Find the first index where cumulative >= p
    #    Shift by one so we always include the token that pushes us over p
    cutoff_mask = (cumulative_probs - sorted_probs) >= p
    sorted_probs[cutoff_mask] = 0.0

    # 5. Renormalize and sample
    sorted_probs = sorted_probs / sorted_probs.sum()
    sampled_pos = torch.multinomial(sorted_probs, num_samples=1).item()
    return sorted_indices[sampled_pos].item()

B.6.2 Worked Example

Sorted distribution:
  A: 0.40, B: 0.30, C: 0.15, D: 0.08, E: 0.05, F: 0.02

Cumulative:
  0.40,   0.70,   0.85,   0.93,   0.98,   1.00

Top-P = 0.90:
  Keep A, B, C, D (cumulative 0.93  0.90).
  Nucleus size = 4.

B.6.3 Top-P Adapts to Distribution Shape

This is what Top-K cannot do:

Peaked distribution:

A: 0.95, B: 0.03, ...
Top-P = 0.90  nucleus = {A} only (size 1)

Flat distribution:

A: 0.10, B: 0.09, C: 0.08, ...
Top-P = 0.90  nucleus = ~15 tokens (size adjusts automatically)

The nucleus contracts when the model is confident and expands when the model is uncertain. That is the right behavior.

B.6.4 Choosing P

P valueEffect
0.1 – 0.5small nucleus, conservative
0.8 – 0.95standard production range
1.0equivalent to unconstrained sampling

Rule of thumb: P = 0.9 or P = 0.95 is the default in most production configurations. The Holtzman et al. 2019 paper that introduced nucleus sampling recommended P = 0.95 for open-ended generation.


B.7.1 How It Works

Instead of committing to one token at each step, beam search maintains B candidate sequences in parallel — the "beam". At every step, each beam is extended by every possible next token, and only the top-B sequences (by cumulative log-probability) survive.

B.7.2 Algorithm

import math
from typing import List, Tuple

def beam_search(
    model,
    prompt: List[int],
    beam_width: int,
    max_length: int,
    alpha: float = 0.6,        # length normalization exponent
) -> List[int]:
    """
    Returns the highest-scoring sequence under length-normalized beam search.
    model: callable that takes token IDs and returns log-probabilities over vocab.
    """
    # Each beam is (sequence, cumulative_log_prob)
    beams: List[Tuple[List[int], float]] = [(prompt[:], 0.0)]
    completed: List[Tuple[List[int], float]] = []

    for step in range(max_length):
        all_candidates: List[Tuple[List[int], float]] = []

        for seq, score in beams:
            log_probs = model(seq)   # shape: (vocab_size,)

            # Expand: consider every possible next token
            for token_id, lp in enumerate(log_probs):
                new_seq   = seq + [token_id]
                new_score = score + lp
                all_candidates.append((new_seq, new_score))

        # Apply length normalization before ranking
        def length_normalized_score(candidate):
            seq, score = candidate
            length = len(seq) - len(prompt)
            return score / (max(length, 1) ** alpha)

        all_candidates.sort(key=length_normalized_score, reverse=True)
        beams = all_candidates[:beam_width]

    # Return the top sequence
    return beams[0][0]

B.7.3 Length Normalization

Without correction, beam search favors shorter sequences because each step multiplies probabilities (< 1), so longer sequences accumulate lower raw scores. The standard fix:

scorenorm=logP(sequence)lengthα\text{score}_{\text{norm}} = \frac{\log P(\text{sequence})}{\text{length}^\alpha}

where α0.6\alpha \approx 0.6 is a common default (used in Google's original Neural Machine Translation system). Higher α penalizes length more aggressively.

B.7.4 When Beam Search Is Appropriate

Strong fit:

  • Machine translation (beam width 4–6 is standard)
  • Abstractive summarization
  • Structured generation with a known grammar (constrained decoding)

Poor fit:

  • Open-ended chat and instruction following — beam search produces fluent but bland outputs that avoid all risk
  • Any task where diversity across multiple generations matters

Modern LLM serving stacks (vLLM, TGI, llama.cpp) do not use beam search for chat models. They use Top-P + Temperature sampling. Beam search remains in NLP pipelines where the output has a single objectively correct form.


B.8 Combination Strategies

B.8.1 The Two Common Combinations

In practice you almost always combine multiple strategies. Here are the two standard patterns:

import torch
import torch.nn.functional as F

# --- Helper: apply Top-K filter ---
def top_k_filter(logits: torch.Tensor, k: int) -> torch.Tensor:
    """Zero out all logits except the top k."""
    if k <= 0:
        return logits
    top_k_values = torch.topk(logits, k).values
    threshold = top_k_values[-1]
    return torch.where(logits >= threshold, logits, torch.full_like(logits, float('-inf')))

# --- Helper: apply Top-P filter ---
def top_p_filter(logits: torch.Tensor, p: float) -> torch.Tensor:
    """Zero out all logits outside the nucleus."""
    probs = F.softmax(logits, dim=-1)
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    cumulative = torch.cumsum(sorted_probs, dim=-1)
    mask = (cumulative - sorted_probs) >= p
    sorted_probs[mask] = 0.0
    # Scatter back to original order
    filtered = torch.zeros_like(probs)
    filtered.scatter_(0, sorted_indices, sorted_probs)
    return torch.log(filtered + 1e-10)  # back to logit space (approx)

# --- Pattern 1: Top-P + Temperature ---
def decode_top_p_temperature(
    logits: torch.Tensor,
    p: float = 0.9,
    temperature: float = 0.7,
) -> int:
    """Standard open-ended generation config."""
    filtered = top_p_filter(logits, p)
    probs = F.softmax(filtered / temperature, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()

# --- Pattern 2: Top-K + Top-P + Temperature ---
def decode_top_k_top_p_temperature(
    logits: torch.Tensor,
    k: int = 50,
    p: float = 0.9,
    temperature: float = 0.7,
) -> int:
    """Hugging Face generate() default-style config."""
    # First narrow to top-k, then apply nucleus threshold, then temperature
    filtered = top_k_filter(logits, k)
    filtered = top_p_filter(filtered, p)
    probs = F.softmax(filtered / temperature, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()

The ordering matters: Top-K first, then Top-P. This way Top-K sets a hard ceiling on candidates and Top-P can only make the nucleus smaller, never larger.

B.8.2 Default Configurations in the Wild

Different inference stacks ship different defaults. These are the de facto settings you will encounter when you first hit an API:

Stack / APItemperaturetop_ptop_kNotes
OpenAI API (chat)1.01.0Both defaults are "off"; users expected to tune
Anthropic Claude API1.0Default is effectively greedy-ish with temperature 1
Hugging Face generate()1.00.950Top-K + Top-P both active by default
llama.cpp0.80.9540Conservative defaults for local use
LLaMA reference inference0.60.9Meta's published defaults for Llama 3
Mistral reference0.71.050Top-K active, Top-P off by default
TaskTemperatureTop-PTop-KNotes
Code generation0.20.95Low T, wide nucleus
Factual Q&A0 or 0.3Greedy or near-greedy
Instruction following0.70.9Standard assistant config
Creative writing0.9 – 1.00.9Let the model explore
Brainstorming1.0 – 1.20.95More temperature, wider nucleus
Machine translationbeam_width=4, α=0.6Beam search still preferred

B.9 Repetition Penalty

B.9.1 The Problem

LLMs can fall into loops:

"The agent opened the pull request. The agent opened the pull request.
The agent opened the pull request. The agent..."

This is a greedy decoding failure mode, but it also appears in sampling when the high-probability token happens to be the one that was just generated. The fix is to penalize tokens that have already appeared.

B.9.2 Implementation

import torch

def apply_repetition_penalty(
    logits: torch.Tensor,
    generated_token_ids: list,
    penalty: float = 1.2,
) -> torch.Tensor:
    """
    For each token that has already been generated, reduce its logit.

    Tokens with positive logits are divided by penalty (reducing them).
    Tokens with negative logits are multiplied by penalty (pushing them more negative).
    penalty = 1.0 → no change.
    penalty = 1.2 → common default (approximately 17% reduction in probability mass).
    """
    logits = logits.clone()
    for token_id in set(generated_token_ids):
        if logits[token_id] > 0:
            logits[token_id] /= penalty
        else:
            logits[token_id] *= penalty
    return logits

The set() deduplication means frequency does not matter — seeing a token once applies the same penalty as seeing it ten times. For frequency-sensitive behavior, see B.11.


B.10 Presence Penalty

B.10.1 What It Does

Presence penalty applies a constant subtraction to every token that has appeared anywhere in the generated text, regardless of how many times it appeared.

import torch

def apply_presence_penalty(
    logits: torch.Tensor,
    generated_token_ids: list,
    presence_penalty: float = 0.6,
) -> torch.Tensor:
    """
    Subtract presence_penalty from the logit of every token seen at least once.

    presence_penalty = 0   → no effect
    presence_penalty = 0.6 → common default (OpenAI API)
    presence_penalty = 2.0 → strong avoidance of any repeated token
    """
    logits = logits.clone()
    for token_id in set(generated_token_ids):
        logits[token_id] -= presence_penalty
    return logits

This is the OpenAI API presence_penalty parameter. Presence penalty encourages the model to introduce new topics — it does not scale with repetition count.


B.11 Frequency Penalty

B.11.1 What It Does

Frequency penalty applies a penalty proportional to how many times each token has already appeared. The more a token repeats, the more it is suppressed.

import torch
from collections import Counter

def apply_frequency_penalty(
    logits: torch.Tensor,
    generated_token_ids: list,
    frequency_penalty: float = 0.5,
) -> torch.Tensor:
    """
    Subtract frequency_penalty * count[token] from each token's logit.

    frequency_penalty = 0   → no effect
    frequency_penalty = 0.5 → common default
    frequency_penalty = 2.0 → strong suppression of repeated tokens
    """
    logits = logits.clone()
    token_counts = Counter(generated_token_ids)
    for token_id, count in token_counts.items():
        logits[token_id] -= frequency_penalty * count
    return logits

This is the OpenAI API frequency_penalty parameter. Unlike presence penalty, frequency penalty keeps subtracting the more times a token appears — a token seen ten times gets ten times the penalty of a token seen once.

B.11.2 Penalty Parameter Reference

ParameterRangeEffect at 0Common default
repetition_penalty1.0 – 1.5no change (multiplicative)1.2
presence_penalty0 – 2.0no change0.6
frequency_penalty0 – 2.0no change0.5

All three can be combined. In practice, presence penalty and frequency penalty are more common in API-facing products; repetition penalty is more common in open-source local inference stacks.


B.12 Appendix Summary

StrategyCore ideaWhen to use
Greedyargmax every stepcode, deterministic tasks
Random samplingmultinomial drawrarely alone; base for the rest
Temperaturescale logits by 1/T before softmaxalways; tune first
Top-Kkeep top K candidatespaired with Top-P for belt-and-suspenders
Top-Pkeep nucleus at cumulative Ppreferred over Top-K alone
Beam searchtrack B best sequencestranslation, constrained generation
Repetition penaltymultiplicative penalty on seen tokenslocal inference defaults
Presence penaltyflat subtract for any seen tokenAPI-level topic diversity
Frequency penaltycount-scaled subtract for seen tokensAPI-level repetition control

Further Reading

  • The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) — the paper that introduced nucleus (Top-P) sampling and named the degeneration problem
  • Hierarchical Neural Story Generation (Fan et al., 2018) — original Top-K sampling applied to story generation
  • CTRL: A Conditional Transformer Language Model for Controllable Generation (Keskar et al., 2019) — discusses repetition penalty in the context of controllable generation

Decoding decisions are a product surface. The defaults your serving stack ships with will quietly shape how your model feels to users. Appendix C answers the questions that come up most often.

Cite this page
Zhang, Wayland (2026). Appendix B: Decoding Strategies. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/appendix-b-decoding-strategies
@incollection{zhang2026transformer_appendix_b_decoding_strategies,
  author = {Zhang, Wayland},
  title = {Appendix B: Decoding Strategies},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/appendix-b-decoding-strategies}
}