One-sentence summary: When sequence length jumps from thousands to millions, O(N²) full Attention hits a wall — Sparse Attention selects which tokens to attend to, Linear Attention changes the computation order, and Infini Attention adds a compressed memory that never grows.

24.1 Full Attention Hits a Wall

24.1.1 The Numbers Are Not Subtle

Standard self-Attention must compute a score for every pair of tokens. For a sequence of length N that is an N×N matrix:

Full Attention N×N matrix growing with sequence length

Sequence length	Attention matrix entries	Relative cost
1,000	1,000,000	1×
4,096	~16,700,000	16×
8,192	~67,000,000	67×
100,000	10,000,000,000	10,000×
1,000,000	1,000,000,000,000	1,000,000×

Ten times longer sequence = one hundred times more computation. That is why O(N²) is described as "quadratic" — it accelerates faster than the input grows.

24.1.2 Memory Is the Real Bottleneck

Computation alone could be handled with more time. Memory cannot be stretched. In FP16, a full attention matrix costs:

Sequence length	Attention matrix	Per-layer VRAM (FP16)
4,096	4096 × 4096	32 MB
8,192	8192 × 8192	128 MB
32,768	32768 × 32768	2 GB
131,072	131072 × 131072	32 GB

A 32-layer model processing a 128k-token context needs over 1 TB just to hold the attention matrices across all layers. No GPU in production has that.

Flash Attention from Chapter 21 helps by never materializing the full matrix in HBM, but it cannot eliminate the fundamental O(N²) computation — it still processes every pair.

24.1.3 The Demand Is Real

Real applications need much longer contexts than standard training supports:

Long code reviews: a large pull request with diffs can span tens of thousands of lines
Document understanding: technical manuals, legal filings, book-length content
Agent task histories: an agent that has executed hundreds of tool calls accumulates a long trace
Video: one hour of encoded video frames can produce hundreds of thousands of tokens

This is the problem Sparse Attention, Linear Attention, and Infini Attention were designed to solve.

24.2 Sparse Attention: Not Every Token Needs to See Every Token

24.2.1 The Intuition

When you read page 500 of a technical document, do you need simultaneous access to every word from page 1? No. You need:

The surrounding context (the last few paragraphs)
Certain global anchors (section headings, key definitions)
Occasionally, specific earlier material

Sparse Attention builds this structure into the attention mask. Instead of a dense N×N matrix of connections, it uses a carefully chosen sparse pattern.

24.2.2 Three Basic Sparse Patterns

BigBird sparse pattern combining Random, Window, and Global attention

Random Attention: each token attends to a randomly sampled set of other tokens. This looks unprincipled but has good theoretical properties: information can propagate across the graph in O(1) hops even though no single token sees the whole sequence.

Sliding Window (Local) Attention: each token attends to the w tokens on either side. This reflects the strong locality of natural language — most syntactic and semantic dependencies are short-range. Complexity: O(N × w).

Global Attention: a small set of special tokens (like [CLS], task-specific tokens, or the first token of each sentence) attend to and are attended by all other tokens. These global "hubs" collect and redistribute sequence-wide information without requiring every token to be connected to every other token.

24.3 Longformer and BigBird

24.3.1 Building a Sparse Attention Matrix Step by Step

Step-by-step construction of a sparse attention matrix

Step 1 — Random tokens: add sparse random connections (shown in yellow) to ensure long-range information flow.

Step 2 — Window tokens: add the diagonal band of local connections. Every token attends to its immediate neighbors.

Global attention tokens completing the sparse pattern

Step 3 — Global tokens: designate a few positions as global hubs. Their full rows and columns are filled in. Everything can reach everything else through at most one global hub.

The final matrix is sparse. Yellow cells are computed; gray cells are skipped entirely.

24.3.2 BigBird: Random + Window + Global

BigBird combines all three patterns:

BigBird Attention = Random + Window + Global

This combination is provably equivalent to full Turing computation in the limit — meaning BigBird can, in principle, express anything full Attention can, just with fewer operations per token.

Complexity with window size w, g global tokens, and r random connections per token:

O(N × (w + g + r))

When w, g, and r are constants (not growing with N), this is O(N) — linear in sequence length.

24.3.3 Longformer vs BigBird

Feature	Longformer	BigBird
Sliding window	Yes	Yes
Global attention	Task-specific tokens	Fixed positions + random
Random attention	No	Yes
Main use	Long document NLP	General long-sequence tasks
Complexity	O(N)	O(N)

Longformer is simpler and task-specific: a QA model designates question tokens as global, so they can attend to the whole document passage. BigBird's random component provides a theoretical guarantee that any two tokens can communicate within a bounded number of hops.

24.4 Sliding Window Attention in Practice

24.4.1 The Locality Assumption

Language has strong local structure:

Subject-verb agreement spans a handful of words in most sentences
Pronoun resolution usually points to the nearest candidate noun
Sentences in the same paragraph discuss the same topic

For the majority of tokens, the most relevant context is nearby. A sliding window of size w captures this for O(N × w) rather than O(N²).

24.4.2 Implementation

def sliding_window_attention(Q, K, V, window_size):
    n = Q.shape[0]
    output = []
    for i in range(n):
        start = max(0, i - window_size // 2)
        end   = min(n, i + window_size // 2 + 1)
        local_K = K[start:end]
        local_V = V[start:end]
        scores  = Q[i] @ local_K.T / math.sqrt(d_k)
        weights = softmax(scores)
        output.append(weights @ local_V)
    return torch.stack(output)

24.4.3 Stacking Layers Expands the Receptive Field

A single sliding-window layer with w = 5 sees 5 tokens. Two stacked layers see 9. In general, after L layers the effective receptive field is:

1 + L × (w - 1)

A 32-layer model with w = 512 reaches a theoretical receptive field of over 16,000 tokens — enough for most practical tasks.

24.4.4 Mistral 7B's Rolling Buffer Cache

Mistral 7B uses sliding window attention with w = 4096. A useful trick: the KV Cache only needs to store the most recent w tokens, making cache memory constant regardless of conversation length.

Standard KV Cache:  stores all history → grows linearly with time
Rolling Buffer:     stores only the last w tokens → constant size

This is the difference between a memory that fills forever and one that slides forward as the conversation grows.

24.5 Linear Attention: Changing the Computation Order

24.5.1 Where O(N²) Actually Comes From

Standard Attention:

Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V

Computation order:

Q @ K^T → N×N matrix — this is O(N²)
softmax(...) → N×N matrix
result @ V → N×d matrix

The intermediate N×N matrix is the source of both the quadratic computation and the quadratic memory.

24.5.2 The Key Insight: Associativity

Matrix multiplication is associative:

(Q @ K^T) @ V  =  Q @ (K^T @ V)

If we compute K^T @ V first:

K^T @ V has shape d × d — not N×N
Q @ (K^T @ V) has shape N × d

The intermediate matrix is now d × d instead of N × N. When N >> d (common for long sequences), this is a huge win.

Linear Attention: computing K^T @ V first avoids the N×N intermediate

Complexity comparison:

Standard: O(N² × d) for the N×N matrix
Linear (reordered): O(N × d²) for the d×d matrix

At N = 100,000 and d = 64: standard needs ~640 billion operations; linear needs ~410 million. Three orders of magnitude.

24.5.3 The Softmax Problem

There is an obstacle: the softmax breaks associativity.

softmax(Q @ K^T) @ V  ≠  Q @ softmax(K^T) @ V

Linear Attention solves this by replacing softmax with a kernel function φ:

Standard Attention:  softmax(Q @ K^T) @ V
Linear Attention:    φ(Q) @ φ(K)^T @ V  =  φ(Q) @ (φ(K)^T @ V)

Where φ is typically something like elu(x) + 1 — a function that keeps values positive (needed for a valid attention distribution) but allows the reordering.

Efficient Attention paper: two equivalent computation orders

24.5.4 Linear Attention Code

def linear_attention(Q, K, V, feature_map):
    """
    Q, K, V: [batch, seq_len, d_model]
    feature_map: e.g. lambda x: F.elu(x) + 1
    """
    Q_prime = feature_map(Q)   # [batch, seq_len, d_model]
    K_prime = feature_map(K)   # [batch, seq_len, d_model]

    # Compute K^T @ V first: [batch, d_model, d_model]
    KV = torch.einsum('bnd,bnm->bdm', K_prime, V)

    # Then Q @ (K^T @ V): [batch, seq_len, d_model]
    output = torch.einsum('bnd,bdm->bnm', Q_prime, KV)

    # Normalize
    Z = torch.einsum('bnd,bd->bn', Q_prime, K_prime.sum(dim=1))
    return output / Z.unsqueeze(-1)

24.5.5 Why Linear Attention Is Not the Default

Linear Attention is elegant on paper. In practice:

Expressiveness drops — replacing softmax with a kernel removes the sharp, peaked attention distributions that full Attention can produce. Some tasks need that sharpness.
Numerical instability — certain kernel choices can cause gradient issues during training.
Quality gap in benchmarks — on most standard NLP tasks, Linear Attention underperforms full Attention, often by a noticeable margin.

This is why Linear Attention has not displaced full Attention in production models, despite the theoretical appeal. Its contribution is conceptual: the idea of reordering the computation to eliminate the N×N matrix directly inspired Infini Attention.

24.6 Infini Attention: Unbounded Context with Fixed Memory

24.6.1 The Problem KV Cache Cannot Solve

Even with Sliding Window Attention, a growing conversation eventually exceeds the window. The model forgets the beginning.

Multi-turn conversation with growing context exceeding the KV Cache window

Imagine an agent that is reading through a long codebase and answering questions. Each successive turn adds more context. Traditional KV Cache grows linearly — after enough turns, it either runs out of memory or the oldest context falls out of the window.

What we want: a memory that captures the full history but uses fixed, bounded storage.

24.6.2 The Core Idea

Google's Infini Attention (2024) gives each attention layer a compressed memory matrix of fixed size d × d. As new segments arrive, information is compressed into this matrix. The matrix never grows — it holds a "rolling summary" of everything seen so far.

Infini-Transformer architecture: compressed memory alongside local attention

Two components per attention layer:

1. Local Attention — standard (or Flash) Attention over the current segment. This captures precise, recent context.

2. Compressed Memory — a fixed d × d matrix that is queried and updated at every segment boundary. This carries summarized information from all past segments.

The final output blends both:

output = β × memory_output + (1 - β) × local_output

where β is a learned gate per layer.

24.6.3 How the Compressed Memory Works

Infini Attention borrows the Linear Attention idea: K^T @ V is a d × d summary of key-value pairs. Instead of discarding this after each segment, Infini Attention keeps it and accumulates:

class InfiniAttentionLayer:
    def __init__(self, d_model):
        self.memory     = torch.zeros(d_model, d_model)
        self.normalizer = torch.zeros(d_model)

    def forward(self, Q, K, V):
        # Retrieve from compressed memory (Linear Attention style)
        memory_out = Q @ self.memory / (Q @ self.normalizer + eps)

        # Local attention over current segment (standard attention)
        local_out = standard_attention(Q, K, V)

        # Blend with learned gate
        beta = torch.sigmoid(self.gate)
        output = beta * memory_out + (1 - beta) * local_out

        # Update compressed memory (incremental)
        self.memory     = self.memory     + K.T @ V
        self.normalizer = self.normalizer + K.sum(dim=0)

        return output

The memory update is additive — new K/V information is folded in, old information gradually gets "diluted" by newer updates. This is the compression: the matrix always stays d × d, but it no longer represents any specific past token exactly.

24.6.4 Why This Achieves "Infinite" Context

The memory matrix size is d × d. It does not grow with sequence length or conversation turns. Process 1 million tokens or 10 million tokens — the same fixed matrix holds the summary.

This is analogous to human working memory: you cannot recall every sentence from a book you read last year, but you retain a compressed summary that lets you reason about its contents. The detail is lost; the essence is retained.

24.6.5 Performance Comparison

Method	Complexity (per step)	Memory footprint	Information loss
Full Attention	O(N²)	O(N²)	None
Sparse Attention	O(N)	O(N)	Slight (sparse coverage)
Linear Attention	O(N)	O(N)	Moderate (kernel approximation)
Infini Attention	O(1)*	O(1)*	Compression loss

*O(1) relative to total history length. Each segment still requires O(segment²) or O(segment) computation internally.

24.6.6 Connection to Gemini 1.5

Google released Gemini 1.5 supporting a 1M token context window shortly after the Infini Attention paper. The architecture details were not publicly disclosed, but the timing and the technical alignment led many researchers to believe Infini-style memory is part of what makes million-token context feasible in production.

Gemini 1.5 and the 1M token context window

The key tradeoff Infini Attention acknowledges openly: compression means information loss. For tasks that require precise recall of an early detail — exact quote from the beginning of a document, a specific number from turn one of a long conversation — compressed memory may not be sufficient. For tasks that require reasoning over general themes and patterns across a long history, it often works well.

24.7 Combining Techniques: A Practical Toolkit

In production, these techniques are not mutually exclusive. The common combinations:

Long-context technique combinations as of 2024

For 10k–50k tokens:

Flash Attention + GQA + Sliding Window

Mature, well-supported, minimal quality risk.

For 50k–200k tokens:

Flash Attention + Sparse Attention (BigBird/Longformer style) + RoPE or ALiBi

Requires more careful tuning of the sparse pattern; sparse patterns need to be hardware-friendly.

For 200k+ tokens or growing unbounded conversation:

Infini Attention + Flash Attention

Only compressed memory can handle genuinely unbounded growth.

Other levers in the toolkit:

Small-window pre-training + large-window fine-tuning with RoPE scaling (Chapter 25)
KV Cache quantization (INT8/FP8) to reduce cache footprint
PagedAttention (vLLM) — manages KV pages like an OS manages virtual memory
Sparse MLP routing (Light/Heavy branches) — route simple tokens through cheaper paths

These can be stacked. A real serving system for a million-token context might combine Flash Attention (efficient kernel), GQA (smaller cache), RoPE with NTK scaling (position extrapolation), and paged KV management (cache reuse across requests).

24.8 Worked Examples

24.8.1 Full Attention vs Sliding Window: Compute Count

Sequence length N = 8, dimension d = 4.

Full Attention:

Q @ K^T: (8×4) × (4×8) → 64 outputs, 4 multiplications each = 256 ops
result @ V: (8×8) × (8×4) → 32 outputs, 8 multiplications each = 256 ops
Total: ~512 multiplications

Sliding Window (w = 3):

Each token attends to 3 neighbors
Per-token score: 3 × d = 12 multiplications; 8 tokens = 96 ops
Per-token value sum: 3 × d = 12 multiplications; 8 tokens = 96 ops
Total: ~192 multiplications — 62% fewer than full Attention

24.8.2 Linear Attention Advantage at Scale

At N = 1000, d = 64:

Method	Approximate cost
Full Attention	N² × d = 64,000,000 ops
Linear Attention	N × d² × 2 = 8,192,000 ops

Linear Attention uses ~87.5% fewer operations at this scale. The advantage grows with N.

24.9 Common Misconceptions

"Sparse Attention loses critical information." Well-designed sparse patterns like BigBird provide a graph-theoretic guarantee: any two tokens can communicate within a bounded number of hops via global or random connections. The information can propagate; it just takes multiple layers.

"Linear Attention is equivalent to Standard Attention." It is not. Replacing softmax with a kernel function changes the effective attention distribution. On tasks requiring sharp, peaked attention (like copying a specific token from far away), Linear Attention often underperforms. It is a different inductive bias, not a free approximation.

"Infini Attention perfectly preserves all history." Compression always loses information. Infini Attention is best thought of as a compressed summary, like notes you take while reading. Exact recall of early tokens is not guaranteed.

"These techniques are mutually exclusive." They are designed to stack. Production systems routinely combine Flash Attention (efficient kernel) + GQA (smaller cache) + Sliding Window (reduced computation) + paged memory (cache management). Each addresses a different part of the long-context problem.

24.10 Chapter Summary

Concept	Key Point
Full Attention bottleneck	O(N²) computation and O(N²) memory — infeasible beyond ~32k tokens
Sliding Window	Each token attends to w neighbors; O(N × w); receptive field expands with depth
Global tokens	A few hub tokens attend everywhere; information reaches all positions through them
BigBird	Random + Window + Global = O(N) with theoretical information-flow guarantee
Linear Attention	Reorder K^T @ V before Q; avoids N×N intermediate; softmax replaced by kernel
Infini Attention	Fixed d×d compressed memory updated per segment; O(1) memory growth
Combining techniques	Flash Attention + GQA + Sparse + Paged cache is a typical production stack

Chapter Checklist

After this chapter, you should be able to:

Explain why O(N²) becomes infeasible and where the cost comes from.
Describe Random, Window, and Global sparse attention patterns.
Explain why BigBird's combination provides an O(N) complexity with information-flow guarantees.
Explain how Linear Attention reorders computation and what it gives up.
Describe the Infini Attention compressed memory and why it stays fixed-size.
Choose an appropriate technique combination for a given context-length target.

See You in the Next Chapter

We have now handled the memory and compute cost of long sequences. But there is a related problem we have been sidestepping: position encoding.

When you extend a model from 4k to 128k tokens, the position embeddings encounter positions they have never seen during training. Chapter 25 explains how positional encoding evolved — from sinusoidal encoding through RoPE, ALiBi, and YaRN — specifically to handle this extrapolation challenge.