One-sentence summary: Decoding strategy is the choice that converts a probability distribution over the entire vocabulary into a single token — greedy picks the peak, sampling rolls the dice, temperature reshapes the curve, and top-k/top-p trim the tails.

B.1 Why Decoding Strategy Matters

B.1.1 From Logits to Token

Every forward pass through a Transformer decoder ends at the same place: a logits vector with one entry per token in the vocabulary.

logits = [2.1, -0.5, 1.3, 0.8, ..., -1.2]   # length = vocab_size

Each value is an unnormalized score. Before you can sample, you convert to a probability distribution through Softmax:

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

Step one — Softmax — is fixed. Step two — picking a token from that distribution — is where decoding strategy lives. This appendix covers every practical option.

B.1.2 Strategy Trade-offs at a Glance

Strategy	Deterministic	Diversity	Risk
Greedy	high	low	repetition, monotony
Random sampling	low	high	incoherent tokens
Top-K	medium	medium	fixed K mismatches distribution shape
Top-P	medium	medium	well-balanced in practice
Beam search	high	low	expensive, safe but dull

B.2 Greedy Decoding

B.2.1 How It Works

The simplest strategy: pick the highest-probability token at every step.

import torch
import torch.nn.functional as F

def greedy_decode(logits: torch.Tensor) -> int:
    """
    logits: 1-D tensor of shape (vocab_size,)
    Returns the token ID with the highest logit.
    """
    return torch.argmax(logits).item()

You do not even need Softmax here — argmax on logits gives the same answer as argmax on probabilities, because Softmax is monotone.

B.2.2 Worked Example

Prompt: "The agent opened a pull"
Step 1:
  logits → softmax → {request: 0.51, comment: 0.22, merge: 0.12, ...}
  greedy picks: "request"

Prompt: "The agent opened a pull request"
Step 2:
  logits → softmax → {and: 0.34, .": 0.29, to: 0.18, ...}
  greedy picks: "and"

Final output: "The agent opened a pull request and ..."

Same input always produces the same output. No randomness.

B.2.3 When to Use Greedy

Good fit:

Code completion where you need a single correct answer
Factual lookups and structured extraction
Any pipeline where downstream code parses the output deterministically

Poor fit:

Creative generation — greedy loops badly once it falls into a high-probability rut
Dialogue where response variety matters
Tasks where the globally best sequence is not the locally best token at each step

B.2.4 Greedy's Failure Mode: Repetition

Because greedy always follows the mode of the distribution, it can spiral into loops like:

"The function returns the value. The function returns the value. The function..."

This is not a bug in the model — it is a bug in the decoding choice. The repetition penalties in section B.9 exist precisely to break this.

B.3 Random Sampling

B.3.1 How It Works

Instead of taking the argmax, draw one token at random from the full probability distribution.

import torch
import torch.nn.functional as F

def random_sample(logits: torch.Tensor) -> int:
    """
    logits: 1-D tensor of shape (vocab_size,)
    Returns a token ID sampled from the softmax distribution.
    """
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()

B.3.2 Worked Example

probs = {request: 0.40, comment: 0.30, review: 0.15, merge: 0.10, issue: 0.05}

Each run might return a different token:
  - "request"  (40% probability)
  - "comment"  (30% probability)
  - "review"   (15% probability)
  - "merge"    (10% probability)
  - "issue"    ( 5% probability)

The same prompt can produce different completions on every call.

B.3.3 Trade-offs

Advantages:

Diverse, varied outputs
Avoids deterministic repetition loops
Reflects the model's full uncertainty about what comes next

Disadvantages:

Low-probability tokens can be selected, producing incoherent text
Outputs are not reproducible without fixing the random seed
Unmodified raw sampling is rarely used in production; you almost always pair it with temperature, top-k, or top-p

B.4 Temperature

B.4.1 How It Works

Temperature rescales the logits before Softmax, sharpening or flattening the distribution:

import torch
import torch.nn.functional as F

def softmax_with_temperature(logits: torch.Tensor, temperature: float) -> torch.Tensor:
    """
    temperature > 1  →  flatter distribution (more random)
    temperature < 1  →  sharper distribution (more deterministic)
    temperature → 0  →  equivalent to greedy (argmax)
    """
    scaled_logits = logits / temperature
    return F.softmax(scaled_logits, dim=-1)

The math is straightforward: divide every logit by T before exponentiating. High T suppresses differences between logits; low T amplifies them.

B.4.2 Numerical Table

Using a fixed logits array [2.0, 1.0, 0.5]:

Temperature	Softmax result (approx.)	Character
0.1	[1.00, 0.00, 0.00]	almost deterministic
0.5	[0.84, 0.11, 0.04]	strong preference for top token
1.0	[0.63, 0.23, 0.14]	raw model distribution
2.0	[0.48, 0.29, 0.23]	noticeably flatter
10.0	[0.36, 0.33, 0.31]	nearly uniform

Note: values are softmax(logits / T), rounded to two decimal places.

Intuition:

T < 1: the model becomes more decisive — the likely token gets even more probability mass
T = 1: no change; the raw distribution is preserved
T > 1: the model becomes more exploratory — low-probability tokens get a bigger slice

B.4.3 Choosing Temperature

Temperature	Effect	Typical use
0 (or → 0)	greedy, fully deterministic	code generation, structured extraction
0.1 – 0.3	very confident	factual Q&A, retrieval
0.5 – 0.7	confident with variation	general assistant dialogue
0.8 – 1.0	balanced	creative writing with quality guardrails
1.0 – 1.5	exploratory	brainstorming, story drafts
> 1.5	highly random	experimental only — output often degrades

B.4.4 Temperature = 0

As T approaches zero, the scaled logits diverge to ±∞, and Softmax collapses to a point mass on the argmax. Many APIs treat temperature=0 as an exact synonym for greedy decoding.

B.5 Top-K Sampling

B.5.1 How It Works

Keep only the K tokens with the highest logits, zero out the rest, then sample from the survivors:

import torch
import torch.nn.functional as F

def top_k_sample(logits: torch.Tensor, k: int, temperature: float = 1.0) -> int:
    """
    logits:      1-D tensor of shape (vocab_size,)
    k:           number of top tokens to keep
    temperature: applied before softmax over the kept tokens
    """
    # 1. Find the k-th largest logit value (threshold)
    top_k_values, top_k_indices = torch.topk(logits, k)

    # 2. Apply temperature and normalize over the top-k only
    top_k_probs = F.softmax(top_k_values / temperature, dim=-1)

    # 3. Sample from the reduced distribution
    sampled_pos = torch.multinomial(top_k_probs, num_samples=1).item()
    return top_k_indices[sampled_pos].item()

B.5.2 Worked Example

Full distribution:
  A: 0.40, B: 0.30, C: 0.15, D: 0.08, E: 0.05, F: 0.02

Top-K with K=3:
  Kept:        A: 0.40, B: 0.30, C: 0.15
  Renormalized: A: 0.47, B: 0.35, C: 0.18

Sample only from {A, B, C} — D, E, F are excluded entirely.

B.5.3 Choosing K

K value	Effect
1	equivalent to greedy
10 – 50	common production range
100+	approaches unconstrained sampling

Rule of thumb: K = 40 to 50 is the default in many open-source inference stacks (LLaMA, Mistral defaults).

B.5.4 Top-K's Limitation

K is a fixed number, but the distribution shape varies wildly across token positions. Consider two extremes:

Peaked distribution:

A: 0.95, B: 0.03, C: 0.01, D: 0.005, ...
K=50 keeps 49 tokens with a combined probability < 0.05.
Those tail tokens should not be candidates at all.

Flat distribution:

A: 0.10, B: 0.09, C: 0.08, D: 0.08, E: 0.07, ...
K=50 may cut off tokens that are perfectly reasonable alternatives.

Fixed K does not adapt to the entropy of the distribution. That is the problem Top-P was designed to solve.

B.6 Top-P (Nucleus) Sampling

B.6.1 How It Works

Top-P keeps the smallest set of tokens whose cumulative probability reaches P, then samples from that set. The candidate set size adapts automatically.

import torch
import torch.nn.functional as F

def top_p_sample(logits: torch.Tensor, p: float, temperature: float = 1.0) -> int:
    """
    logits:      1-D tensor of shape (vocab_size,)
    p:           cumulative probability threshold (e.g. 0.9)
    temperature: applied before softmax
    """
    # 1. Apply temperature and compute probabilities
    probs = F.softmax(logits / temperature, dim=-1)

    # 2. Sort descending
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)

    # 3. Cumulative sum
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # 4. Find the first index where cumulative >= p
    #    Shift by one so we always include the token that pushes us over p
    cutoff_mask = (cumulative_probs - sorted_probs) >= p
    sorted_probs[cutoff_mask] = 0.0

    # 5. Renormalize and sample
    sorted_probs = sorted_probs / sorted_probs.sum()
    sampled_pos = torch.multinomial(sorted_probs, num_samples=1).item()
    return sorted_indices[sampled_pos].item()

B.6.2 Worked Example

Sorted distribution:
  A: 0.40, B: 0.30, C: 0.15, D: 0.08, E: 0.05, F: 0.02

Cumulative:
  0.40,   0.70,   0.85,   0.93,   0.98,   1.00

Top-P = 0.90:
  Keep A, B, C, D (cumulative 0.93 ≥ 0.90).
  Nucleus size = 4.

B.6.3 Top-P Adapts to Distribution Shape

This is what Top-K cannot do:

Peaked distribution:

A: 0.95, B: 0.03, ...
Top-P = 0.90 → nucleus = {A} only (size 1)

Flat distribution:

A: 0.10, B: 0.09, C: 0.08, ...
Top-P = 0.90 → nucleus = ~15 tokens (size adjusts automatically)

The nucleus contracts when the model is confident and expands when the model is uncertain. That is the right behavior.

B.6.4 Choosing P

P value	Effect
0.1 – 0.5	small nucleus, conservative
0.8 – 0.95	standard production range
1.0	equivalent to unconstrained sampling

Rule of thumb: P = 0.9 or P = 0.95 is the default in most production configurations. The Holtzman et al. 2019 paper that introduced nucleus sampling recommended P = 0.95 for open-ended generation.

B.7 Beam Search

B.7.1 How It Works

Instead of committing to one token at each step, beam search maintains B candidate sequences in parallel — the "beam". At every step, each beam is extended by every possible next token, and only the top-B sequences (by cumulative log-probability) survive.

B.7.2 Algorithm

import math
from typing import List, Tuple

def beam_search(
    model,
    prompt: List[int],
    beam_width: int,
    max_length: int,
    alpha: float = 0.6,        # length normalization exponent
) -> List[int]:
    """
    Returns the highest-scoring sequence under length-normalized beam search.
    model: callable that takes token IDs and returns log-probabilities over vocab.
    """
    # Each beam is (sequence, cumulative_log_prob)
    beams: List[Tuple[List[int], float]] = [(prompt[:], 0.0)]
    completed: List[Tuple[List[int], float]] = []

    for step in range(max_length):
        all_candidates: List[Tuple[List[int], float]] = []

        for seq, score in beams:
            log_probs = model(seq)   # shape: (vocab_size,)

            # Expand: consider every possible next token
            for token_id, lp in enumerate(log_probs):
                new_seq   = seq + [token_id]
                new_score = score + lp
                all_candidates.append((new_seq, new_score))

        # Apply length normalization before ranking
        def length_normalized_score(candidate):
            seq, score = candidate
            length = len(seq) - len(prompt)
            return score / (max(length, 1) ** alpha)

        all_candidates.sort(key=length_normalized_score, reverse=True)
        beams = all_candidates[:beam_width]

    # Return the top sequence
    return beams[0][0]

B.7.3 Length Normalization

Without correction, beam search favors shorter sequences because each step multiplies probabilities (< 1), so longer sequences accumulate lower raw scores. The standard fix:

\text{score}_{\text{norm}} = \frac{\log P(\text{sequence})}{\text{length}^\alpha}

where $\alpha \approx 0.6$ is a common default (used in Google's original Neural Machine Translation system). Higher α penalizes length more aggressively.

B.7.4 When Beam Search Is Appropriate

Strong fit:

Machine translation (beam width 4–6 is standard)
Abstractive summarization
Structured generation with a known grammar (constrained decoding)

Poor fit:

Open-ended chat and instruction following — beam search produces fluent but bland outputs that avoid all risk
Any task where diversity across multiple generations matters

Modern LLM serving stacks (vLLM, TGI, llama.cpp) do not use beam search for chat models. They use Top-P + Temperature sampling. Beam search remains in NLP pipelines where the output has a single objectively correct form.

B.8 Combination Strategies

B.8.1 The Two Common Combinations

In practice you almost always combine multiple strategies. Here are the two standard patterns:

import torch
import torch.nn.functional as F

# --- Helper: apply Top-K filter ---
def top_k_filter(logits: torch.Tensor, k: int) -> torch.Tensor:
    """Zero out all logits except the top k."""
    if k <= 0:
        return logits
    top_k_values = torch.topk(logits, k).values
    threshold = top_k_values[-1]
    return torch.where(logits >= threshold, logits, torch.full_like(logits, float('-inf')))

# --- Helper: apply Top-P filter ---
def top_p_filter(logits: torch.Tensor, p: float) -> torch.Tensor:
    """Zero out all logits outside the nucleus."""
    probs = F.softmax(logits, dim=-1)
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    cumulative = torch.cumsum(sorted_probs, dim=-1)
    mask = (cumulative - sorted_probs) >= p
    sorted_probs[mask] = 0.0
    # Scatter back to original order
    filtered = torch.zeros_like(probs)
    filtered.scatter_(0, sorted_indices, sorted_probs)
    return torch.log(filtered + 1e-10)  # back to logit space (approx)

# --- Pattern 1: Top-P + Temperature ---
def decode_top_p_temperature(
    logits: torch.Tensor,
    p: float = 0.9,
    temperature: float = 0.7,
) -> int:
    """Standard open-ended generation config."""
    filtered = top_p_filter(logits, p)
    probs = F.softmax(filtered / temperature, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()

# --- Pattern 2: Top-K + Top-P + Temperature ---
def decode_top_k_top_p_temperature(
    logits: torch.Tensor,
    k: int = 50,
    p: float = 0.9,
    temperature: float = 0.7,
) -> int:
    """Hugging Face generate() default-style config."""
    # First narrow to top-k, then apply nucleus threshold, then temperature
    filtered = top_k_filter(logits, k)
    filtered = top_p_filter(filtered, p)
    probs = F.softmax(filtered / temperature, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()

The ordering matters: Top-K first, then Top-P. This way Top-K sets a hard ceiling on candidates and Top-P can only make the nucleus smaller, never larger.

B.8.2 Default Configurations in the Wild

Different inference stacks ship different defaults. These are the de facto settings you will encounter when you first hit an API:

Stack / API	temperature	top_p	top_k	Notes
OpenAI API (chat)	1.0	1.0	—	Both defaults are "off"; users expected to tune
Anthropic Claude API	1.0	—	—	Default is effectively greedy-ish with temperature 1
Hugging Face `generate()`	1.0	0.9	50	Top-K + Top-P both active by default
llama.cpp	0.8	0.95	40	Conservative defaults for local use
LLaMA reference inference	0.6	0.9	—	Meta's published defaults for Llama 3
Mistral reference	0.7	1.0	50	Top-K active, Top-P off by default

B.8.3 Recommended Starting Points by Task

Task	Temperature	Top-P	Top-K	Notes
Code generation	0.2	0.95	—	Low T, wide nucleus
Factual Q&A	0 or 0.3	—	—	Greedy or near-greedy
Instruction following	0.7	0.9	—	Standard assistant config
Creative writing	0.9 – 1.0	0.9	—	Let the model explore
Brainstorming	1.0 – 1.2	0.95	—	More temperature, wider nucleus
Machine translation	beam_width=4, α=0.6	—	—	Beam search still preferred

B.9 Repetition Penalty

B.9.1 The Problem

LLMs can fall into loops:

"The agent opened the pull request. The agent opened the pull request.
The agent opened the pull request. The agent..."

This is a greedy decoding failure mode, but it also appears in sampling when the high-probability token happens to be the one that was just generated. The fix is to penalize tokens that have already appeared.

B.9.2 Implementation

import torch

def apply_repetition_penalty(
    logits: torch.Tensor,
    generated_token_ids: list,
    penalty: float = 1.2,
) -> torch.Tensor:
    """
    For each token that has already been generated, reduce its logit.

    Tokens with positive logits are divided by penalty (reducing them).
    Tokens with negative logits are multiplied by penalty (pushing them more negative).
    penalty = 1.0 → no change.
    penalty = 1.2 → common default (approximately 17% reduction in probability mass).
    """
    logits = logits.clone()
    for token_id in set(generated_token_ids):
        if logits[token_id] > 0:
            logits[token_id] /= penalty
        else:
            logits[token_id] *= penalty
    return logits

The set() deduplication means frequency does not matter — seeing a token once applies the same penalty as seeing it ten times. For frequency-sensitive behavior, see B.11.

B.10 Presence Penalty

B.10.1 What It Does

Presence penalty applies a constant subtraction to every token that has appeared anywhere in the generated text, regardless of how many times it appeared.

import torch

def apply_presence_penalty(
    logits: torch.Tensor,
    generated_token_ids: list,
    presence_penalty: float = 0.6,
) -> torch.Tensor:
    """
    Subtract presence_penalty from the logit of every token seen at least once.

    presence_penalty = 0   → no effect
    presence_penalty = 0.6 → common default (OpenAI API)
    presence_penalty = 2.0 → strong avoidance of any repeated token
    """
    logits = logits.clone()
    for token_id in set(generated_token_ids):
        logits[token_id] -= presence_penalty
    return logits

This is the OpenAI API presence_penalty parameter. Presence penalty encourages the model to introduce new topics — it does not scale with repetition count.

B.11 Frequency Penalty

B.11.1 What It Does

Frequency penalty applies a penalty proportional to how many times each token has already appeared. The more a token repeats, the more it is suppressed.

import torch
from collections import Counter

def apply_frequency_penalty(
    logits: torch.Tensor,
    generated_token_ids: list,
    frequency_penalty: float = 0.5,
) -> torch.Tensor:
    """
    Subtract frequency_penalty * count[token] from each token's logit.

    frequency_penalty = 0   → no effect
    frequency_penalty = 0.5 → common default
    frequency_penalty = 2.0 → strong suppression of repeated tokens
    """
    logits = logits.clone()
    token_counts = Counter(generated_token_ids)
    for token_id, count in token_counts.items():
        logits[token_id] -= frequency_penalty * count
    return logits

This is the OpenAI API frequency_penalty parameter. Unlike presence penalty, frequency penalty keeps subtracting the more times a token appears — a token seen ten times gets ten times the penalty of a token seen once.

B.11.2 Penalty Parameter Reference

Parameter	Range	Effect at 0	Common default
`repetition_penalty`	1.0 – 1.5	no change (multiplicative)	1.2
`presence_penalty`	0 – 2.0	no change	0.6
`frequency_penalty`	0 – 2.0	no change	0.5

All three can be combined. In practice, presence penalty and frequency penalty are more common in API-facing products; repetition penalty is more common in open-source local inference stacks.

B.12 Appendix Summary

Strategy	Core idea	When to use
Greedy	argmax every step	code, deterministic tasks
Random sampling	multinomial draw	rarely alone; base for the rest
Temperature	scale logits by 1/T before softmax	always; tune first
Top-K	keep top K candidates	paired with Top-P for belt-and-suspenders
Top-P	keep nucleus at cumulative P	preferred over Top-K alone
Beam search	track B best sequences	translation, constrained generation
Repetition penalty	multiplicative penalty on seen tokens	local inference defaults
Presence penalty	flat subtract for any seen token	API-level topic diversity
Frequency penalty	count-scaled subtract for seen tokens	API-level repetition control