One-sentence summary: Token embeddings say "what this word is," positional encodings say "where this word sits." Transformers combine them by addition rather than concatenation because addition preserves both signals without increasing model width.
14.1 Two Inputs, One Step Before Everything Else
In Chapters 4 and 5, we covered the two components that prepare input for the Transformer:
| Chapter | Component | Purpose |
|---|---|---|
| Chapter 4 | Embedding | Converts token IDs into vectors (semantic information) |
| Chapter 5 | Positional Encoding | Adds location information to each vector |
This chapter goes one level deeper: why are these two signals combined by addition rather than concatenation?
The answer involves a few geometric intuitions, a parameter-count argument, and one honest acknowledgment that neural networks are not spreadsheets.
14.2 The Nature of Each Signal
14.2.1 Token Embeddings: Semantic Information
Recall the embedding lookup table from Chapter 4:
Each token has a corresponding vector in a table of shape [vocab_size, d_model]. The vector encodes the token's semantic meaning:
- "agent" and "workflow" live near each other (both in the agentic computation domain)
- "agent" and "integer" live far apart (different semantic territory)
The embedding answers: what does this token mean?
14.2.2 Positional Encodings: Location Information
The positional encoding from Chapter 5 assigns a unique vector to each position in the sequence:
- Position 0 has its own vector
- Position 1 has a different vector
- Position 15 has yet another
The positional encoding answers: where in the sequence does this token appear?
14.2.3 Why Both Matter
Consider these two prompts:
"The agent reviewed the PR."
"The PR reviewed the agent."
Same tokens. Different order. Completely different meaning.
The Transformer needs both signals simultaneously:
- What each token is (Embedding)
- Where each token sits (Positional Encoding)
The question is how to combine them.
14.3 Addition vs Concatenation
14.3.1 Two Options
There are two intuitive ways to combine two vectors:
Option 1: Concatenation
input = [Embedding; Positional Encoding]
resulting dimension: d_model + d_model = 2 × d_model
Option 2: Addition
input = Embedding + Positional Encoding
resulting dimension: d_model (unchanged)
The Transformer uses addition. Why?
14.3.2 The Case Against Concatenation
Concatenation doubles the vector width.
If d_model = 512, concatenation produces a 1024-dimensional vector. Every subsequent operation — the Q, K, V projections, the FFN layers, the output projection — would now need to handle 1024-dimensional inputs instead of 512. That means:
- All weight matrices double in size
- All matrix multiplications become four times more expensive (since compute scales with the product of input and output dimensions)
- Memory usage roughly doubles for every parameter
You could project back down to 512 afterward, but then you have added an extra linear layer just to undo the concatenation. Net effect: more complexity, more parameters, no demonstrated benefit.
Addition keeps the contract simple:
[d_model] + [d_model] → [d_model]
Everything downstream sees the same shape it expected.
14.3.3 The Case For Addition
1. Dimensions do not grow
The model width d_model stays fixed from the embedding layer through the entire Transformer stack. Every block, every projection, every norm layer operates on the same shape. This consistency is architecturally clean.
2. Information can coexist in high dimensions
In a 512-dimensional space, two independent signals can be added without necessarily destroying each other. Think of it geometrically: each dimension can carry a component of the semantic signal and a component of the positional signal simultaneously.
A concrete example:
embedding = [0.5, 0.3, -0.2, 0.8, ...] # semantic signal
position = [0.1, 0.0, 0.1, -0.1, ...] # position signal
combined = [0.6, 0.3, -0.1, 0.7, ...] # both present
The combined vector is not the same as either input alone, but both signals have influenced it.
3. The Attention layers can learn to separate them
The learned weight matrices Wq, Wk, Wv project the combined vector into different subspaces. Over training, some dimensions of the projection can learn to be sensitive to semantic patterns, others to positional patterns. The model does not need us to hand it separate signals — it can learn to extract what it needs.
14.3.4 An Analogy
Think of a version-controlled codebase where each commit has:
- Content (what changed in the code)
- Timestamp (when it happened)
You could store these in two separate databases and always look them both up. Or you could design a combined record that encodes both — and train your query system to interpret both from one lookup.
The Transformer chooses the combined record approach. The downstream "query system" (Attention) is powerful enough to make use of it.
14.4 A Concrete Calculation
14.4.1 Step by Step
The diagram traces one specific value through a training step:
Before training (forward pass):
embedding_value = 0.9 # one dimension of the token's embedding
positional_value = 0.1 # same dimension in the position vector
combined_value = 0.9 + 0.1 = 1.0
During training (backpropagation updates the embedding):
new_embedding_value = old_embedding_value - lr * gradient
= 0.9 - 0.1 * (-0.4)
= 0.9 + 0.04
= 0.94
Next forward pass:
new_combined_value = new_embedding_value + positional_value
= 0.94 + 0.1
= 1.04
14.4.2 Key Observations
From this trace, three things stand out:
- Token embeddings are trainable — they update via backpropagation at each training step
- Positional encodings are fixed (in the original Transformer's sinusoidal scheme) — they do not change during training
- Addition happens every forward pass — it is not a one-time preprocessing step
14.5 Why This Design Is Sound
14.5.1 The Vector Space Perspective
In a high-dimensional vector space, a direction represents a concept. Adding two vectors can be thought of as:
- The embedding vector points in the "semantic direction" for this token
- The positional encoding vector adds a "positional offset"
The result points somewhere that carries both pieces of information. The Attention mechanism's learned projections then steer each head toward whichever type of information is relevant for its task.
14.5.2 The Orthogonality Intuition
The argument for why addition works rests loosely on an orthogonality assumption: semantic information and positional information tend to live in different "directions" within the high-dimensional space, so they do not cancel each other out when added.
This is not a theorem — it is an empirical observation. But it holds up in practice. Models trained this way do learn to distinguish word identity from word position.
14.5.3 What Attention Does With the Combined Signal
During Attention:
Q = (Embedding + PE) @ Wq
K = (Embedding + PE) @ Wk
Over training, Wq and Wk can develop dimensions that respond primarily to semantic content (so Q matches K based on meaning) and dimensions that respond primarily to position (so Q matches K based on location). A single head might specialize in one; another head might specialize in the other.
This is why separating the signals before Attention is not necessary — Attention can do the separation itself.
14.6 Variants: Different Positional Encoding Methods
14.6.1 Original Method: Fixed Sinusoidal Encoding
input = Embedding(token_ids) + PositionalEncoding(positions)
The original 2017 Transformer used fixed sinusoidal functions. No positional parameters are learned — the encoding is computed deterministically from the position index.
14.6.2 Learned Positional Embeddings
GPT-style models use learned positional embeddings: the position vectors are trainable parameters, just like token embeddings.
class TransformerInput(nn.Module):
def __init__(self, vocab_size, d_model, max_len, dropout=0.1):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(max_len, d_model) # learned
self.dropout = nn.Dropout(dropout)
self.scale = d_model ** 0.5
def forward(self, x):
# x: [batch_size, seq_len] token IDs
token_emb = self.token_embedding(x) # [batch, seq, d_model]
token_emb = token_emb * self.scale # optional scaling
positions = torch.arange(x.size(1), device=x.device)
pos_emb = self.position_embedding(positions) # [seq, d_model]
combined = token_emb + pos_emb # [batch, seq, d_model]
return self.dropout(combined)
Learned position embeddings let the model decide which positional patterns are useful for the task, rather than imposing a fixed structure.
14.6.3 RoPE: Rotary Position Embedding
More recent models use RoPE (Rotary Position Embedding), which takes a different approach. Instead of adding a position vector to the token embedding, RoPE applies a rotation to the Q and K vectors based on position:
Q_rotated = rotate(Q, position)
K_rotated = rotate(K, position)
RoPE advantages:
- Encodes relative position more explicitly (the dot product between two rotated vectors depends on their position difference)
- Generalizes better to sequence lengths not seen during training
- Used by LLaMA, GPT-NeoX, Mistral, and most modern open-source LLMs
We will revisit RoPE in Chapter 25. For now, the important point is that the interface stays the same: the model needs both content and order, and the design question is how to encode order efficiently.
14.6.4 Comparing Approaches
| Type | Example | Advantages | Disadvantages |
|---|---|---|---|
| Fixed sinusoidal | Original Transformer | Can theoretically extrapolate to any length | May not be optimal for all tasks |
| Learned absolute | GPT-2, GPT-3 | Learns task-specific position patterns | Cannot generalize to lengths not seen in training |
| RoPE | LLaMA, Mistral | Better relative position, strong length generalization | Slightly more complex to implement |
| ALiBi | BLOOM | Additive bias in Attention scores | Different interface than standard addition |
14.7 Dimension Tracking
Input token_ids: [4, 16] # 4 sequences, 16 tokens each
Token Embedding:
lookup: token_embedding([4, 16])
output: [4, 16, 512] # each token → 512-dim vector
Position Embedding:
positions: [0, 1, 2, ..., 15]
lookup: position_embedding([16])
output: [16, 512] # each position → 512-dim vector
broadcast: [4, 16, 512] # extend across batch
Addition:
[4, 16, 512] + [4, 16, 512] = [4, 16, 512]
Final output: [4, 16, 512] # semantic + positional, same shape
The shape never changes. Every component downstream sees [batch, seq, d_model].
14.8 Common Questions
14.8.1 Does Addition Lose Information?
Not in practice. The key reasons:
d_model = 512(or 768, or 4096) provides enough dimensions that semantic and positional signals do not interfere catastrophically- Attention's Wq, Wk, Wv projections can learn to extract either signal
- Multiple layers of Transformer blocks can progressively refine how the two signals are used
14.8.2 Why Must Positional Encodings Be Small?
If the positional encoding values are much larger than the token embedding values, they overwhelm the semantic signal:
embedding = [0.5, 0.3, -0.2] # semantic
position = [10, 20, -15] # positional — too large!
combined = [10.5, 20.3, -15.2] # mostly position, semantics drowned out
Sinusoidal encodings produce values in [-1, 1], matching the typical magnitude of embedding vectors. Learned position embeddings are also kept in a similar range through normal training dynamics.
14.8.3 Learned vs Fixed: Which Is Better?
| Type | Advantages | Disadvantages |
|---|---|---|
| Fixed (sinusoidal) | Works for any sequence length | Not necessarily optimal |
| Learned | Can learn task-optimal position patterns | Cannot extrapolate beyond training length |
In practice, learned position embeddings tend to perform slightly better on standard benchmarks, which is why GPT and BERT both use them. RoPE and ALiBi are increasingly preferred for long-context models.
14.9 Chapter Summary
14.9.1 Key Concepts
| Signal | Source | Represents | Trainable? |
|---|---|---|---|
| Token embedding | Embedding lookup table | Token semantics | Yes |
| Positional encoding | Position embedding / sinusoidal function | Sequence position | Depends on method |
| Combined input | Addition of both | Semantics + position | — |
14.9.2 Why Addition, Not Concatenation
- Dimension stability:
d_modelstays fixed throughout the architecture - Both signals coexist: high-dimensional vectors have room for multiple signals
- Attention can separate them: learned Wq, Wk, Wv projections can specialize per signal
14.9.3 Core Takeaway
Token embeddings and positional encodings are combined by element-wise addition. This design keeps model width constant, lets both signals coexist in a high-dimensional space, and relies on Attention's learned projections to extract whichever signal matters for each relationship. It is a compact, effective design choice — not a limitation.
Chapter Checklist
After this chapter, you should be able to:
- Explain what token embeddings and positional encodings each represent.
- Explain why concatenation increases model width and why that is a problem.
- Explain why addition preserves both signals in a high-dimensional space.
- Describe how Attention uses the combined signal.
- Name at least three positional encoding variants and their tradeoffs.
See You in the Next Chapter
We now have all the pieces:
- Token embeddings (Chapter 4)
- Positional encodings (Chapter 5, revisited here)
- Attention with Q, K, and V (Chapters 9–12)
- Residual connections and Dropout (Chapter 13)
- The addition operation that combines content and position (this chapter)
Chapter 15 assembles these into a complete forward pass — tracing a sequence of raw text all the way through to output probabilities, dimension by dimension.