One-sentence summary: Positional encoding evolved from "add an absolute vector before Attention" to "rotate Q and K inside Attention" and "penalize Attention by distance," with each generation buying longer context and better relative-position behavior.
25.1 The Big Picture
In Chapter 5 we met the original Sinusoidal encoding from the 2017 paper. It was clever and it worked well enough for 512-token sequences. Fast-forward to 2024 and production models routinely handle 128k tokens. That seven-year gap forced the field to completely rethink how position information flows into the model.
25.1.1 What went wrong with the original scheme
Sinusoidal encoding has two problems that compound each other.
The first is absolute position. The model learns patterns tied to specific slot numbers: "position 37 looks like this." At inference time, the moment you feed a sequence longer than the training length, the model encounters absolute positions it has never seen. Performance collapses.
The second is the injection point. Sinusoidal adds a position vector to the embedding before Attention. By the time Q and K compute their dot product, the position signal has been linearly mixed with the semantic signal through the weight matrices:
Q = (x + PE_m) × W_Q
K = (x + PE_n) × W_K
Q · K = cross-terms that tangle semantic and position
You get four cross-terms when you expand that dot product. Position and content are inseparable, which makes it hard for the model to learn clean relative patterns.
Traditional learned relative encodings (T5-style) fixed the relative problem but broke KV Cache compatibility. Every new token would require recomputing attention over the full history because the relative position table changes as the sequence grows.
The question the field kept asking: can we get relative position behavior and keep KV Cache working?
25.1.2 The five mainstream schemes
| Method | Full name | Representative models |
|---|---|---|
| Sinusoidal | Sine/Cosine Position Embedding | original Transformer |
| T5 Relative | Learned Relative Embeddings | T5, mT5 |
| RoPE | Rotary Position Embedding | LLaMA, GPT-NeoX, Mistral |
| YaRN | Yet another RoPE extensioN | Code Llama, Qwen |
| ALiBi | Attention with Linear Biases | BLOOM, MPT |
This chapter covers RoPE, ALiBi, and YaRN in depth. The others appear where needed for contrast.
25.2 Sinusoidal: The Additive Baseline
Quick recap before we move forward.
25.2.1 The formula
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
Each position gets a deterministic vector with the same dimension as the token embedding. The model adds this vector to the embedding table lookup before anything else happens.
Embedding vector (semantic content):
| Token | dim 1 | dim 2 | dim 3 | dim 4 |
|---|---|---|---|---|
| "agent" | 0.62 | -0.51 | 0.09 | 0.85 |
| "opened" | 0.07 | 0.23 | -0.40 | 0.11 |
Position vector (slot information):
| Position | dim 1 | dim 2 | dim 3 | dim 4 |
|---|---|---|---|---|
| 1 | 0.00 | 1.00 | 0.00 | 1.00 |
| 2 | 0.84 | 0.54 | 0.68 | 0.73 |
| 3 | 0.90 | -0.41 | 0.99 | 0.07 |
Input to Transformer = Embedding + Position.
25.2.2 Why "before Attention" is the weak point
Once you add position before Attention and pass through W_Q and W_K, relative distance becomes implicit. The model can learn to extract it, but it has to work harder. And when you ask the model to process position 5000 and it was only trained to 4096, it is encountering absolute slots it has never seen. There is no graceful degradation, just confusion.
Early GPT models had strict context limits for exactly this reason. You get what you train on and nothing beyond.
25.3 RoPE: Rotary Position Embedding
RoPE was proposed by Su Jianlin in 2021 and quickly became the dominant method for decoder-only LLMs. LLaMA 1, LLaMA 2, Mistral, GPT-NeoX, and many others use it.
25.3.1 The revolutionary idea: from addition to rotation
Instead of adding a position vector to the token embedding, RoPE rotates the Q and K vectors inside Attention using a rotation matrix that depends on position.
The contrast in one sentence:
- Sinusoidal:
embedding + position_vector → multiply through W_Q and W_K - RoPE:
multiply through W_Q and W_K → rotate by position angle
This sounds like a small difference. The consequences are large.
25.3.2 2D geometric intuition
Start in 2D. You have two vectors w1 and w2 in the plane. If you rotate both by the same angle theta, their relative angle is unchanged. And since dot product depends only on the angle between vectors and their magnitudes:
w1 · w2 = |w1| |w2| cos(angle)
rotating both by the same theta leaves the dot product the same. Rotating by different amounts changes the dot product in a way that depends on the difference of the rotation angles.
That is the key geometric fact. If you rotate Q at position m by angle m * theta and rotate K at position n by angle n * theta, their dot product ends up depending on (n - m) * theta---the relative distance.
The 2D rotation matrix is:
[w1'] [cos(theta) -sin(theta)] [w1]
[w2'] = [sin(theta) cos(theta)] [w2]
25.3.3 Extending to high dimensions
Real query and key vectors have head_dim dimensions (typically 64, 128, or more). RoPE handles this by splitting the vector into head_dim / 2 pairs of dimensions and rotating each pair independently:
- dimensions 1-2: angle
m * theta_1 - dimensions 3-4: angle
m * theta_2 - ...
- dimensions (head_dim-1) to head_dim: angle
m * theta_{head_dim/2}
Each pair uses a different base frequency:
For a d=6 vector, the three pairs get angles:
| Position m | Pair 1 (theta=0.1) | Pair 2 (theta=0.2) | Pair 3 (theta=0.4) |
|---|---|---|---|
| m=0 | 0.0 | 0.0 | 0.0 |
| m=1 | 0.1 | 0.2 | 0.4 |
| m=2 | 0.2 | 0.4 | 0.8 |
| m=3 | 0.3 | 0.6 | 1.2 |
25.3.4 The full rotation matrix
For a d-dimensional vector at position m, RoPE applies a block-diagonal rotation matrix R_m:
[cos(m*θ₁) -sin(m*θ₁) 0 0 ... 0 0 ]
[sin(m*θ₁) cos(m*θ₁) 0 0 ... 0 0 ]
R_m = [ 0 0 cos(m*θ₂) -sin(m*θ₂) ... 0 0 ]
[ 0 0 sin(m*θ₂) cos(m*θ₂) ... 0 0 ]
[ ... ... ... ... ... ... ... ]
[ 0 0 0 0 ... cos(m*θ_{d/2}) -sin(m*θ_{d/2})]
[ 0 0 0 0 ... sin(m*θ_{d/2}) cos(m*θ_{d/2})]
Each 2×2 block is an independent rotation. The matrix is sparse and efficient to apply.
25.3.5 Why relative position emerges automatically
This is the elegant part. Let q_m be the query at position m and k_n be the key at position n.
After applying RoPE:
q_m' = R_m * q_mk_n' = R_n * k_n
The Attention score becomes:
q_m' · k_n' = (R_m * q_m)ᵀ (R_n * k_n)
= q_mᵀ * R_mᵀ * R_n * k_n
= q_mᵀ * R_{n-m} * k_n
The last step uses the rotation matrix property . The Attention score depends only on the relative distance (n - m), not on the absolute positions m and n separately. Relative position emerges from the math without any extra machinery.
KV Cache still works because k_n' is a deterministic function of k_n and position n alone. When you extend the sequence by one token, you compute k_n' for that token and append it to the cache. No recomputation of older keys required.
25.3.6 Long-distance decay
RoPE has one more nice property: as the relative distance increases, the upper bound on the Attention score decreases. This matches the empirical observation that nearby tokens are usually more relevant than distant ones. The model gets a soft locality prior without any hand-engineered falloff.
25.3.7 Efficient implementation with complex numbers
Multiplying dense rotation matrices is expensive. The standard implementation uses complex arithmetic:
def apply_rope(x, freqs):
# x: [batch, seq_len, n_heads, head_dim]
# freqs: [seq_len, head_dim // 2]
# treat pairs of reals as complex numbers
x_complex = torch.view_as_complex(x.reshape(*x.shape[:-1], -1, 2))
# build e^(i * theta) for each position and dimension pair
freqs_complex = torch.polar(torch.ones_like(freqs), freqs)
# complex multiply = rotation
x_rotated = x_complex * freqs_complex
# back to reals
return torch.view_as_real(x_rotated).flatten(-2)
Complex multiplication (a + bi)(c + di) = (ac - bd) + (ad + bc)i is exactly the 2D rotation formula. You get the rotation at the cost of four multiplications and two additions per pair---much cheaper than a full matrix multiply.
25.4 ALiBi: Attention with Linear Biases
Press et al. proposed ALiBi in 2021. It takes a completely different philosophy from RoPE.
25.4.1 The idea: penalize distance directly
ALiBi does not touch embeddings or Q/K vectors at all. It adds a penalty to the Attention scores after the Q·K dot product:
Standard: Attention = softmax(Q Kᵀ / sqrt(d)) V
ALiBi: Attention = softmax(Q Kᵀ / sqrt(d) + m * bias) V
The bias matrix encodes relative distance with a simple triangular structure:
bias = [ 0 ] (query at position 1)
[ -1 0 ] (query at position 2)
[ -2 -1 0 ] (query at position 3)
[ -3 -2 -1 0 ] (query at position 4)
[ -4 -3 -2 -1 0 ] (query at position 5)
m is a per-head slope. The slopes are not learned---they are fixed at initialization as powers of 2 spaced across the number of heads.
25.4.2 What this does to Attention
Say the current query is at position 5, attending to positions 1 through 5. With slope m=1:
| Target | Distance | Bias | Effect after softmax |
|---|---|---|---|
| position 1 | 4 | -4 | strongly suppressed |
| position 2 | 3 | -3 | significantly suppressed |
| position 3 | 2 | -2 | moderately suppressed |
| position 4 | 1 | -1 | slightly suppressed |
| position 5 | 0 | 0 | no change |
Nearby tokens get more weight. The bias implements locality without any learned parameters.
25.4.3 Why ALiBi extrapolates well
The bias matrix is deterministic and position-invariant. You can compute it for any sequence length without training data. A model trained at 1024 tokens encounters positions it has never seen at inference time for lengths of 2048 or 4096---but the bias formula is the same. Experiments show ALiBi models trained at 1024 can extrapolate to 2048+ with minimal degradation.
That is the reason BLOOM (176B) and MPT-7B chose ALiBi for their architecture. They wanted aggressive long-context extrapolation without additional fine-tuning.
25.4.4 Implementation
def alibi_bias(n_heads, seq_len):
# per-head slopes: 2^(-8/n_heads * 1), 2^(-8/n_heads * 2), ...
slopes = 2 ** (-8 / n_heads * torch.arange(1, n_heads + 1))
# relative distance matrix
positions = torch.arange(seq_len)
distances = positions.unsqueeze(0) - positions.unsqueeze(1) # [seq, seq]
# bias = slope * distance, broadcast across heads
bias = slopes.view(-1, 1, 1) * distances.unsqueeze(0) # [heads, seq, seq]
return bias
No extra parameters. No calibration data. The simplicity is the point.
25.4.5 Tradeoffs
Strengths:
- Essentially free to implement
- No additional parameters
- Strong zero-shot extrapolation
Weaknesses:
- Simple linear penalty may be too coarse for tasks needing precise long-range information
- Some retrieval-heavy tasks do better with RoPE
- The slope schedule is a fixed hyperparameter choice, not tunable per task
25.5 YaRN: Extending RoPE Beyond Training Length
RoPE's relative-position behavior is excellent within the training context length. Beyond it, the model encounters rotation angles it has never been trained on, and performance degrades. YaRN (Yet another RoPE extensioN) is designed specifically to fix this.
25.5.1 The extrapolation problem
Suppose a model is trained with a 4k context. At inference with 8k tokens:
- Positions 1-4000: familiar rotation angles
- Positions 4001-8000: rotation angles the model has never produced during training
The Attention patterns for the unfamiliar part of the sequence can collapse.
25.5.2 Position Interpolation (PI): the simple fix
The straightforward approach is to compress longer sequences into the trained range:
f'(x_m, m, theta) = f(x_m, m * L / L', theta)
where L is the training length (4k) and L' is the target length (8k). Position 8000 maps to position 4000. The model sees only familiar angles.
The cost: high-frequency pairs lose precision. Adjacent tokens that differ by one position now differ by half a position unit. Fine-grained local information gets blurred.
25.5.3 NTK-Aware Interpolation: smarter scaling
NTK-Aware interpolation applies different scale factors to different frequency bands:
- Low-frequency pairs (long-range): scale more aggressively. They handle coarse positional identity.
- High-frequency pairs (short-range): scale less. They encode local distinctions that must stay precise.
This is the scheme used in Code Llama, Qwen 7B, and several other models when their context windows were extended.
25.5.4 YaRN's complete formula
YaRN adds one more ingredient beyond NTK-Aware: Attention temperature scaling.
f'(x_m, m, theta) = f(x_m, g(m), h(theta))
The temperature adjustment:
softmax(Q_mᵀ K_n / (t * sqrt(d)))
where: sqrt(1/t) = 0.1 * ln(s) + 1, s = L' / L
As the scale factor s grows (longer context), t adjusts the softmax temperature to keep the Attention distribution from becoming too diffuse across many more tokens.
The practical win: a model trained on 4k tokens can be YaRN-extended to 32k or 128k with less than 0.1% of the original pretraining token count used for fine-tuning. For example:
- Original: 4k context, trained on 1T tokens
- YaRN extension: 32k context, fine-tuned on ~1B tokens (0.1%)
That is why Code Llama's extended variants and Qwen's long-context versions exist without full retraining.
25.6 Comparison
25.6.1 Technical comparison
| Feature | Sinusoidal | RoPE | ALiBi | YaRN |
|---|---|---|---|---|
| Injection point | embedding | Q and K inside Attention | Attention scores | Q and K inside Attention |
| Position type | absolute | relative | relative | relative |
| Extrapolation | poor | medium | strong | strong |
| KV Cache compatible | yes | yes | yes | yes |
| Extra parameters | none | none | none | none |
| Compute overhead | low | medium | low | medium |
25.6.2 Which model uses what
| Model | Encoding | Context |
|---|---|---|
| GPT-3 | learned absolute | 2048 |
| LLaMA 1 | RoPE | 2048 |
| LLaMA 2 | RoPE | 4096 |
| Code Llama | RoPE + YaRN | 16384 |
| Mistral 7B | RoPE | 8192 |
| BLOOM | ALiBi | 2048 |
| MPT-7B | ALiBi | 65536 |
| Qwen | RoPE + Dynamic NTK | 8192-32768 |
25.6.3 Decision guide
Use RoPE if:
- You need precise local and mid-range position information
- You are working within the training context length
- You want compatibility with the LLaMA/Mistral ecosystem
Use ALiBi if:
- You need strong zero-shot length extrapolation
- You want the simplest possible implementation
- Memory and compute are tight
Use YaRN if:
- You have an existing RoPE model and need to extend context length
- You have a small fine-tuning budget (1B tokens or less)
- Target length is 16k, 32k, or higher
25.7 Implementation Reference
25.7.1 RoPE frequencies
def precompute_freqs(head_dim, max_seq_len, theta=10000.0):
# theta_i = 1 / (theta ^ (2i / head_dim))
freqs = 1.0 / (theta ** (torch.arange(0, head_dim, 2).float() / head_dim))
positions = torch.arange(max_seq_len)
# outer product: [seq_len, head_dim // 2]
return torch.outer(positions, freqs)
25.7.2 ALiBi slopes
def get_alibi_slopes(n_heads):
# ratio between adjacent slopes: 2^(-8/n_heads)
ratio = 2 ** (-8 / n_heads)
# slopes for each head: ratio, ratio^2, ratio^3, ...
return ratio ** torch.arange(1, n_heads + 1)
25.8 Chapter Summary
25.8.1 Key concepts
| Concept | Meaning |
|---|---|
| RoPE | Rotates Q and K by position-dependent angle; relative distance falls out of the dot product |
| ALiBi | Adds a linear distance penalty to Attention scores after Q·K |
| YaRN | Rescales RoPE frequency bands to handle context beyond training length |
| Absolute vs relative | Sinusoidal encodes slot number; RoPE and ALiBi encode distance |
| Extrapolation | Behavior when inference length exceeds training length |
25.8.2 The evolution in one diagram
2017: Sinusoidal
| Problem: absolute position, poor extrapolation
v
2021: RoPE (Su Jianlin)
| Problem: degrades beyond training length
v
2021: ALiBi (Press et al.)
| Simple linear falloff, strong extrapolation
v
2023: YaRN (Peng et al.)
| Extends RoPE to 100k+ with minimal retraining
v
Current: 128k+ contexts are standard
25.8.3 Core takeaway
The job of positional encoding is to tell the model who comes before whom. Sinusoidal uses addition to assign absolute addresses. RoPE uses rotation so relative distance appears from inside Attention. ALiBi uses a penalty to make far tokens speak more quietly. None is universally best---choose based on your context length requirements and ecosystem.
Chapter Checklist
After this chapter, you should be able to:
- Explain why Sinusoidal encoding struggles at long context.
- Describe the geometric intuition behind RoPE using 2D rotation.
- Derive why RoPE Attention scores depend on relative position (n-m) rather than absolute positions m and n.
- Explain ALiBi's mechanism: what the bias matrix looks like and why it extrapolates.
- Describe what Position Interpolation and NTK-Aware Interpolation do differently.
- Explain YaRN's temperature adjustment and its role in long-context extension.
- Choose the right encoding scheme for a given model architecture and context requirement.
See You in the Next Chapter
That is enough for position encoding. If you can explain why RoPE Attention scores are a function of (n-m) and not of m and n separately, you have internalized the core idea.
Now we move from architecture choices to adaptation. Chapter 26 covers LoRA and QLoRA---the practical workhorses for fine-tuning large models on commodity hardware.