One-sentence summary: When sequence length jumps from thousands to millions, O(N²) full Attention hits a wall — Sparse Attention selects which tokens to attend to, Linear Attention changes the computation order, and Infini Attention adds a compressed memory that never grows.
24.1 Full Attention Hits a Wall
24.1.1 The Numbers Are Not Subtle
Standard self-Attention must compute a score for every pair of tokens. For a sequence of length N that is an N×N matrix:
| Sequence length | Attention matrix entries | Relative cost |
|---|---|---|
| 1,000 | 1,000,000 | 1× |
| 4,096 | ~16,700,000 | 16× |
| 8,192 | ~67,000,000 | 67× |
| 100,000 | 10,000,000,000 | 10,000× |
| 1,000,000 | 1,000,000,000,000 | 1,000,000× |
Ten times longer sequence = one hundred times more computation. That is why O(N²) is described as "quadratic" — it accelerates faster than the input grows.
24.1.2 Memory Is the Real Bottleneck
Computation alone could be handled with more time. Memory cannot be stretched. In FP16, a full attention matrix costs:
| Sequence length | Attention matrix | Per-layer VRAM (FP16) |
|---|---|---|
| 4,096 | 4096 × 4096 | 32 MB |
| 8,192 | 8192 × 8192 | 128 MB |
| 32,768 | 32768 × 32768 | 2 GB |
| 131,072 | 131072 × 131072 | 32 GB |
A 32-layer model processing a 128k-token context needs over 1 TB just to hold the attention matrices across all layers. No GPU in production has that.
Flash Attention from Chapter 21 helps by never materializing the full matrix in HBM, but it cannot eliminate the fundamental O(N²) computation — it still processes every pair.
24.1.3 The Demand Is Real
Real applications need much longer contexts than standard training supports:
- Long code reviews: a large pull request with diffs can span tens of thousands of lines
- Document understanding: technical manuals, legal filings, book-length content
- Agent task histories: an agent that has executed hundreds of tool calls accumulates a long trace
- Video: one hour of encoded video frames can produce hundreds of thousands of tokens
This is the problem Sparse Attention, Linear Attention, and Infini Attention were designed to solve.
24.2 Sparse Attention: Not Every Token Needs to See Every Token
24.2.1 The Intuition
When you read page 500 of a technical document, do you need simultaneous access to every word from page 1? No. You need:
- The surrounding context (the last few paragraphs)
- Certain global anchors (section headings, key definitions)
- Occasionally, specific earlier material
Sparse Attention builds this structure into the attention mask. Instead of a dense N×N matrix of connections, it uses a carefully chosen sparse pattern.
24.2.2 Three Basic Sparse Patterns
Random Attention: each token attends to a randomly sampled set of other tokens. This looks unprincipled but has good theoretical properties: information can propagate across the graph in O(1) hops even though no single token sees the whole sequence.
Sliding Window (Local) Attention: each token attends to the w tokens on either side. This reflects the strong locality of natural language — most syntactic and semantic dependencies are short-range. Complexity: O(N × w).
Global Attention: a small set of special tokens (like [CLS], task-specific tokens, or the first token of each sentence) attend to and are attended by all other tokens. These global "hubs" collect and redistribute sequence-wide information without requiring every token to be connected to every other token.
24.3 Longformer and BigBird
24.3.1 Building a Sparse Attention Matrix Step by Step
Step 1 — Random tokens: add sparse random connections (shown in yellow) to ensure long-range information flow.
Step 2 — Window tokens: add the diagonal band of local connections. Every token attends to its immediate neighbors.
Step 3 — Global tokens: designate a few positions as global hubs. Their full rows and columns are filled in. Everything can reach everything else through at most one global hub.
The final matrix is sparse. Yellow cells are computed; gray cells are skipped entirely.
24.3.2 BigBird: Random + Window + Global
BigBird combines all three patterns:
BigBird Attention = Random + Window + Global
This combination is provably equivalent to full Turing computation in the limit — meaning BigBird can, in principle, express anything full Attention can, just with fewer operations per token.
Complexity with window size w, g global tokens, and r random connections per token:
O(N × (w + g + r))
When w, g, and r are constants (not growing with N), this is O(N) — linear in sequence length.
24.3.3 Longformer vs BigBird
| Feature | Longformer | BigBird |
|---|---|---|
| Sliding window | Yes | Yes |
| Global attention | Task-specific tokens | Fixed positions + random |
| Random attention | No | Yes |
| Main use | Long document NLP | General long-sequence tasks |
| Complexity | O(N) | O(N) |
Longformer is simpler and task-specific: a QA model designates question tokens as global, so they can attend to the whole document passage. BigBird's random component provides a theoretical guarantee that any two tokens can communicate within a bounded number of hops.
24.4 Sliding Window Attention in Practice
24.4.1 The Locality Assumption
Language has strong local structure:
- Subject-verb agreement spans a handful of words in most sentences
- Pronoun resolution usually points to the nearest candidate noun
- Sentences in the same paragraph discuss the same topic
For the majority of tokens, the most relevant context is nearby. A sliding window of size w captures this for O(N × w) rather than O(N²).
24.4.2 Implementation
def sliding_window_attention(Q, K, V, window_size):
n = Q.shape[0]
output = []
for i in range(n):
start = max(0, i - window_size // 2)
end = min(n, i + window_size // 2 + 1)
local_K = K[start:end]
local_V = V[start:end]
scores = Q[i] @ local_K.T / math.sqrt(d_k)
weights = softmax(scores)
output.append(weights @ local_V)
return torch.stack(output)
24.4.3 Stacking Layers Expands the Receptive Field
A single sliding-window layer with w = 5 sees 5 tokens. Two stacked layers see 9. In general, after L layers the effective receptive field is:
1 + L × (w - 1)
A 32-layer model with w = 512 reaches a theoretical receptive field of over 16,000 tokens — enough for most practical tasks.
24.4.4 Mistral 7B's Rolling Buffer Cache
Mistral 7B uses sliding window attention with w = 4096. A useful trick: the KV Cache only needs to store the most recent w tokens, making cache memory constant regardless of conversation length.
Standard KV Cache: stores all history → grows linearly with time
Rolling Buffer: stores only the last w tokens → constant size
This is the difference between a memory that fills forever and one that slides forward as the conversation grows.
24.5 Linear Attention: Changing the Computation Order
24.5.1 Where O(N²) Actually Comes From
Standard Attention:
Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V
Computation order:
Q @ K^T→ N×N matrix — this is O(N²)softmax(...)→ N×N matrixresult @ V→ N×d matrix
The intermediate N×N matrix is the source of both the quadratic computation and the quadratic memory.
24.5.2 The Key Insight: Associativity
Matrix multiplication is associative:
(Q @ K^T) @ V = Q @ (K^T @ V)
If we compute K^T @ V first:
K^T @ Vhas shaped × d— not N×NQ @ (K^T @ V)has shapeN × d
The intermediate matrix is now d × d instead of N × N. When N >> d (common for long sequences), this is a huge win.
Complexity comparison:
- Standard: O(N² × d) for the N×N matrix
- Linear (reordered): O(N × d²) for the d×d matrix
At N = 100,000 and d = 64: standard needs ~640 billion operations; linear needs ~410 million. Three orders of magnitude.
24.5.3 The Softmax Problem
There is an obstacle: the softmax breaks associativity.
softmax(Q @ K^T) @ V ≠ Q @ softmax(K^T) @ V
Linear Attention solves this by replacing softmax with a kernel function φ:
Standard Attention: softmax(Q @ K^T) @ V
Linear Attention: φ(Q) @ φ(K)^T @ V = φ(Q) @ (φ(K)^T @ V)
Where φ is typically something like elu(x) + 1 — a function that keeps values positive (needed for a valid attention distribution) but allows the reordering.
24.5.4 Linear Attention Code
def linear_attention(Q, K, V, feature_map):
"""
Q, K, V: [batch, seq_len, d_model]
feature_map: e.g. lambda x: F.elu(x) + 1
"""
Q_prime = feature_map(Q) # [batch, seq_len, d_model]
K_prime = feature_map(K) # [batch, seq_len, d_model]
# Compute K^T @ V first: [batch, d_model, d_model]
KV = torch.einsum('bnd,bnm->bdm', K_prime, V)
# Then Q @ (K^T @ V): [batch, seq_len, d_model]
output = torch.einsum('bnd,bdm->bnm', Q_prime, KV)
# Normalize
Z = torch.einsum('bnd,bd->bn', Q_prime, K_prime.sum(dim=1))
return output / Z.unsqueeze(-1)
24.5.5 Why Linear Attention Is Not the Default
Linear Attention is elegant on paper. In practice:
- Expressiveness drops — replacing softmax with a kernel removes the sharp, peaked attention distributions that full Attention can produce. Some tasks need that sharpness.
- Numerical instability — certain kernel choices can cause gradient issues during training.
- Quality gap in benchmarks — on most standard NLP tasks, Linear Attention underperforms full Attention, often by a noticeable margin.
This is why Linear Attention has not displaced full Attention in production models, despite the theoretical appeal. Its contribution is conceptual: the idea of reordering the computation to eliminate the N×N matrix directly inspired Infini Attention.
24.6 Infini Attention: Unbounded Context with Fixed Memory
24.6.1 The Problem KV Cache Cannot Solve
Even with Sliding Window Attention, a growing conversation eventually exceeds the window. The model forgets the beginning.
Imagine an agent that is reading through a long codebase and answering questions. Each successive turn adds more context. Traditional KV Cache grows linearly — after enough turns, it either runs out of memory or the oldest context falls out of the window.
What we want: a memory that captures the full history but uses fixed, bounded storage.
24.6.2 The Core Idea
Google's Infini Attention (2024) gives each attention layer a compressed memory matrix of fixed size d × d. As new segments arrive, information is compressed into this matrix. The matrix never grows — it holds a "rolling summary" of everything seen so far.
Two components per attention layer:
1. Local Attention — standard (or Flash) Attention over the current segment. This captures precise, recent context.
2. Compressed Memory — a fixed d × d matrix that is queried and updated at every segment boundary. This carries summarized information from all past segments.
The final output blends both:
output = β × memory_output + (1 - β) × local_output
where β is a learned gate per layer.
24.6.3 How the Compressed Memory Works
Infini Attention borrows the Linear Attention idea: K^T @ V is a d × d summary of key-value pairs. Instead of discarding this after each segment, Infini Attention keeps it and accumulates:
class InfiniAttentionLayer:
def __init__(self, d_model):
self.memory = torch.zeros(d_model, d_model)
self.normalizer = torch.zeros(d_model)
def forward(self, Q, K, V):
# Retrieve from compressed memory (Linear Attention style)
memory_out = Q @ self.memory / (Q @ self.normalizer + eps)
# Local attention over current segment (standard attention)
local_out = standard_attention(Q, K, V)
# Blend with learned gate
beta = torch.sigmoid(self.gate)
output = beta * memory_out + (1 - beta) * local_out
# Update compressed memory (incremental)
self.memory = self.memory + K.T @ V
self.normalizer = self.normalizer + K.sum(dim=0)
return output
The memory update is additive — new K/V information is folded in, old information gradually gets "diluted" by newer updates. This is the compression: the matrix always stays d × d, but it no longer represents any specific past token exactly.
24.6.4 Why This Achieves "Infinite" Context
The memory matrix size is d × d. It does not grow with sequence length or conversation turns. Process 1 million tokens or 10 million tokens — the same fixed matrix holds the summary.
This is analogous to human working memory: you cannot recall every sentence from a book you read last year, but you retain a compressed summary that lets you reason about its contents. The detail is lost; the essence is retained.
24.6.5 Performance Comparison
| Method | Complexity (per step) | Memory footprint | Information loss |
|---|---|---|---|
| Full Attention | O(N²) | O(N²) | None |
| Sparse Attention | O(N) | O(N) | Slight (sparse coverage) |
| Linear Attention | O(N) | O(N) | Moderate (kernel approximation) |
| Infini Attention | O(1)* | O(1)* | Compression loss |
*O(1) relative to total history length. Each segment still requires O(segment²) or O(segment) computation internally.
24.6.6 Connection to Gemini 1.5
Google released Gemini 1.5 supporting a 1M token context window shortly after the Infini Attention paper. The architecture details were not publicly disclosed, but the timing and the technical alignment led many researchers to believe Infini-style memory is part of what makes million-token context feasible in production.
The key tradeoff Infini Attention acknowledges openly: compression means information loss. For tasks that require precise recall of an early detail — exact quote from the beginning of a document, a specific number from turn one of a long conversation — compressed memory may not be sufficient. For tasks that require reasoning over general themes and patterns across a long history, it often works well.
24.7 Combining Techniques: A Practical Toolkit
In production, these techniques are not mutually exclusive. The common combinations:
For 10k–50k tokens:
Flash Attention + GQA + Sliding Window
Mature, well-supported, minimal quality risk.
For 50k–200k tokens:
Flash Attention + Sparse Attention (BigBird/Longformer style) + RoPE or ALiBi
Requires more careful tuning of the sparse pattern; sparse patterns need to be hardware-friendly.
For 200k+ tokens or growing unbounded conversation:
Infini Attention + Flash Attention
Only compressed memory can handle genuinely unbounded growth.
Other levers in the toolkit:
- Small-window pre-training + large-window fine-tuning with RoPE scaling (Chapter 25)
- KV Cache quantization (INT8/FP8) to reduce cache footprint
- PagedAttention (vLLM) — manages KV pages like an OS manages virtual memory
- Sparse MLP routing (Light/Heavy branches) — route simple tokens through cheaper paths
These can be stacked. A real serving system for a million-token context might combine Flash Attention (efficient kernel), GQA (smaller cache), RoPE with NTK scaling (position extrapolation), and paged KV management (cache reuse across requests).
24.8 Worked Examples
24.8.1 Full Attention vs Sliding Window: Compute Count
Sequence length N = 8, dimension d = 4.
Full Attention:
Q @ K^T: (8×4) × (4×8) → 64 outputs, 4 multiplications each = 256 opsresult @ V: (8×8) × (8×4) → 32 outputs, 8 multiplications each = 256 ops- Total: ~512 multiplications
Sliding Window (w = 3):
- Each token attends to 3 neighbors
- Per-token score: 3 × d = 12 multiplications; 8 tokens = 96 ops
- Per-token value sum: 3 × d = 12 multiplications; 8 tokens = 96 ops
- Total: ~192 multiplications — 62% fewer than full Attention
24.8.2 Linear Attention Advantage at Scale
At N = 1000, d = 64:
| Method | Approximate cost |
|---|---|
| Full Attention | N² × d = 64,000,000 ops |
| Linear Attention | N × d² × 2 = 8,192,000 ops |
Linear Attention uses ~87.5% fewer operations at this scale. The advantage grows with N.
24.9 Common Misconceptions
"Sparse Attention loses critical information." Well-designed sparse patterns like BigBird provide a graph-theoretic guarantee: any two tokens can communicate within a bounded number of hops via global or random connections. The information can propagate; it just takes multiple layers.
"Linear Attention is equivalent to Standard Attention." It is not. Replacing softmax with a kernel function changes the effective attention distribution. On tasks requiring sharp, peaked attention (like copying a specific token from far away), Linear Attention often underperforms. It is a different inductive bias, not a free approximation.
"Infini Attention perfectly preserves all history." Compression always loses information. Infini Attention is best thought of as a compressed summary, like notes you take while reading. Exact recall of early tokens is not guaranteed.
"These techniques are mutually exclusive." They are designed to stack. Production systems routinely combine Flash Attention (efficient kernel) + GQA (smaller cache) + Sliding Window (reduced computation) + paged memory (cache management). Each addresses a different part of the long-context problem.
24.10 Chapter Summary
| Concept | Key Point |
|---|---|
| Full Attention bottleneck | O(N²) computation and O(N²) memory — infeasible beyond ~32k tokens |
| Sliding Window | Each token attends to w neighbors; O(N × w); receptive field expands with depth |
| Global tokens | A few hub tokens attend everywhere; information reaches all positions through them |
| BigBird | Random + Window + Global = O(N) with theoretical information-flow guarantee |
| Linear Attention | Reorder K^T @ V before Q; avoids N×N intermediate; softmax replaced by kernel |
| Infini Attention | Fixed d×d compressed memory updated per segment; O(1) memory growth |
| Combining techniques | Flash Attention + GQA + Sparse + Paged cache is a typical production stack |
Chapter Checklist
After this chapter, you should be able to:
- Explain why O(N²) becomes infeasible and where the cost comes from.
- Describe Random, Window, and Global sparse attention patterns.
- Explain why BigBird's combination provides an O(N) complexity with information-flow guarantees.
- Explain how Linear Attention reorders computation and what it gives up.
- Describe the Infini Attention compressed memory and why it stays fixed-size.
- Choose an appropriate technique combination for a given context-length target.
Further Reading
- Longformer: The Long-Document Transformer (Beltagy et al., 2020)
- Big Bird: Transformers for Longer Sequences (Zaheer et al., 2020)
- Efficient Attention: Attention with Linear Complexities (Shen et al., 2018) — arXiv 1812.01243
- Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (Google, 2024)
- Mistral 7B Technical Report — Rolling Buffer KV Cache with Sliding Window
See You in the Next Chapter
We have now handled the memory and compute cost of long sequences. But there is a related problem we have been sidestepping: position encoding.
When you extend a model from 4k to 128k tokens, the position embeddings encounter positions they have never seen during training. Chapter 25 explains how positional encoding evolved — from sinusoidal encoding through RoPE, ALiBi, and YaRN — specifically to handle this extrapolation challenge.