One-sentence summary: Token embeddings say "what this word is," positional encodings say "where this word sits." Transformers combine them by addition rather than concatenation because addition preserves both signals without increasing model width.

14.1 Two Inputs, One Step Before Everything Else

In Chapters 4 and 5, we covered the two components that prepare input for the Transformer:

Chapter	Component	Purpose
Chapter 4	Embedding	Converts token IDs into vectors (semantic information)
Chapter 5	Positional Encoding	Adds location information to each vector

This chapter goes one level deeper: why are these two signals combined by addition rather than concatenation?

The answer involves a few geometric intuitions, a parameter-count argument, and one honest acknowledgment that neural networks are not spreadsheets.

14.2 The Nature of Each Signal

14.2.1 Token Embeddings: Semantic Information

Recall the embedding lookup table from Chapter 4:

Embedding table mapping token IDs to vectors (from Chapter 4)

Each token has a corresponding vector in a table of shape [vocab_size, d_model]. The vector encodes the token's semantic meaning:

"agent" and "workflow" live near each other (both in the agentic computation domain)
"agent" and "integer" live far apart (different semantic territory)

The embedding answers: what does this token mean?

14.2.2 Positional Encodings: Location Information

The positional encoding from Chapter 5 assigns a unique vector to each position in the sequence:

Sinusoidal positional encoding formula (from Chapter 5)

Position 0 has its own vector
Position 1 has a different vector
Position 15 has yet another

The positional encoding answers: where in the sequence does this token appear?

14.2.3 Why Both Matter

Consider these two prompts:

"The agent reviewed the PR."
"The PR reviewed the agent."

Same tokens. Different order. Completely different meaning.

The Transformer needs both signals simultaneously:

What each token is (Embedding)
Where each token sits (Positional Encoding)

The question is how to combine them.

14.3 Addition vs Concatenation

14.3.1 Two Options

There are two intuitive ways to combine two vectors:

Option 1: Concatenation

input = [Embedding; Positional Encoding]
resulting dimension: d_model + d_model = 2 × d_model

Option 2: Addition

input = Embedding + Positional Encoding
resulting dimension: d_model (unchanged)

The Transformer uses addition. Why?

14.3.2 The Case Against Concatenation

Concatenation doubles the vector width.

If d_model = 512, concatenation produces a 1024-dimensional vector. Every subsequent operation — the Q, K, V projections, the FFN layers, the output projection — would now need to handle 1024-dimensional inputs instead of 512. That means:

All weight matrices double in size
All matrix multiplications become four times more expensive (since compute scales with the product of input and output dimensions)
Memory usage roughly doubles for every parameter

You could project back down to 512 afterward, but then you have added an extra linear layer just to undo the concatenation. Net effect: more complexity, more parameters, no demonstrated benefit.

Addition keeps the contract simple:

[d_model] + [d_model] → [d_model]

Everything downstream sees the same shape it expected.

14.3.3 The Case For Addition

1. Dimensions do not grow

The model width d_model stays fixed from the embedding layer through the entire Transformer stack. Every block, every projection, every norm layer operates on the same shape. This consistency is architecturally clean.

2. Information can coexist in high dimensions

In a 512-dimensional space, two independent signals can be added without necessarily destroying each other. Think of it geometrically: each dimension can carry a component of the semantic signal and a component of the positional signal simultaneously.

A concrete example:

embedding = [0.5,  0.3, -0.2,  0.8, ...]  # semantic signal
position  = [0.1,  0.0,  0.1, -0.1, ...]  # position signal
combined  = [0.6,  0.3, -0.1,  0.7, ...]  # both present

The combined vector is not the same as either input alone, but both signals have influenced it.

3. The Attention layers can learn to separate them

The learned weight matrices Wq, Wk, Wv project the combined vector into different subspaces. Over training, some dimensions of the projection can learn to be sensitive to semantic patterns, others to positional patterns. The model does not need us to hand it separate signals — it can learn to extract what it needs.

14.3.4 An Analogy

Think of a version-controlled codebase where each commit has:

Content (what changed in the code)
Timestamp (when it happened)

You could store these in two separate databases and always look them both up. Or you could design a combined record that encodes both — and train your query system to interpret both from one lookup.

The Transformer chooses the combined record approach. The downstream "query system" (Attention) is powerful enough to make use of it.

14.4 A Concrete Calculation

14.4.1 Step by Step

Token embedding and positional encoding addition, followed by training update to the embedding

The diagram traces one specific value through a training step:

Before training (forward pass):

embedding_value   = 0.9    # one dimension of the token's embedding
positional_value  = 0.1    # same dimension in the position vector

combined_value = 0.9 + 0.1 = 1.0

During training (backpropagation updates the embedding):

new_embedding_value = old_embedding_value - lr * gradient
                    = 0.9 - 0.1 * (-0.4)
                    = 0.9 + 0.04
                    = 0.94

Next forward pass:

new_combined_value = new_embedding_value + positional_value
                   = 0.94 + 0.1
                   = 1.04

14.4.2 Key Observations

From this trace, three things stand out:

Token embeddings are trainable — they update via backpropagation at each training step
Positional encodings are fixed (in the original Transformer's sinusoidal scheme) — they do not change during training
Addition happens every forward pass — it is not a one-time preprocessing step

14.5 Why This Design Is Sound

14.5.1 The Vector Space Perspective

In a high-dimensional vector space, a direction represents a concept. Adding two vectors can be thought of as:

The embedding vector points in the "semantic direction" for this token
The positional encoding vector adds a "positional offset"

The result points somewhere that carries both pieces of information. The Attention mechanism's learned projections then steer each head toward whichever type of information is relevant for its task.

14.5.2 The Orthogonality Intuition

The argument for why addition works rests loosely on an orthogonality assumption: semantic information and positional information tend to live in different "directions" within the high-dimensional space, so they do not cancel each other out when added.

This is not a theorem — it is an empirical observation. But it holds up in practice. Models trained this way do learn to distinguish word identity from word position.

14.5.3 What Attention Does With the Combined Signal

During Attention:

Q = (Embedding + PE) @ Wq
K = (Embedding + PE) @ Wk

Over training, Wq and Wk can develop dimensions that respond primarily to semantic content (so Q matches K based on meaning) and dimensions that respond primarily to position (so Q matches K based on location). A single head might specialize in one; another head might specialize in the other.

This is why separating the signals before Attention is not necessary — Attention can do the separation itself.

14.6 Variants: Different Positional Encoding Methods

14.6.1 Original Method: Fixed Sinusoidal Encoding

input = Embedding(token_ids) + PositionalEncoding(positions)

The original 2017 Transformer used fixed sinusoidal functions. No positional parameters are learned — the encoding is computed deterministically from the position index.

14.6.2 Learned Positional Embeddings

GPT-style models use learned positional embeddings: the position vectors are trainable parameters, just like token embeddings.

class TransformerInput(nn.Module):
    def __init__(self, vocab_size, d_model, max_len, dropout=0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_len, d_model)   # learned
        self.dropout = nn.Dropout(dropout)
        self.scale = d_model ** 0.5

    def forward(self, x):
        # x: [batch_size, seq_len] token IDs
        token_emb = self.token_embedding(x)        # [batch, seq, d_model]
        token_emb = token_emb * self.scale         # optional scaling

        positions = torch.arange(x.size(1), device=x.device)
        pos_emb = self.position_embedding(positions)   # [seq, d_model]

        combined = token_emb + pos_emb             # [batch, seq, d_model]
        return self.dropout(combined)

Learned position embeddings let the model decide which positional patterns are useful for the task, rather than imposing a fixed structure.

14.6.3 RoPE: Rotary Position Embedding

More recent models use RoPE (Rotary Position Embedding), which takes a different approach. Instead of adding a position vector to the token embedding, RoPE applies a rotation to the Q and K vectors based on position:

Q_rotated = rotate(Q, position)
K_rotated = rotate(K, position)

RoPE advantages:

Encodes relative position more explicitly (the dot product between two rotated vectors depends on their position difference)
Generalizes better to sequence lengths not seen during training
Used by LLaMA, GPT-NeoX, Mistral, and most modern open-source LLMs

We will revisit RoPE in Chapter 25. For now, the important point is that the interface stays the same: the model needs both content and order, and the design question is how to encode order efficiently.

14.6.4 Comparing Approaches

Type	Example	Advantages	Disadvantages
Fixed sinusoidal	Original Transformer	Can theoretically extrapolate to any length	May not be optimal for all tasks
Learned absolute	GPT-2, GPT-3	Learns task-specific position patterns	Cannot generalize to lengths not seen in training
RoPE	LLaMA, Mistral	Better relative position, strong length generalization	Slightly more complex to implement
ALiBi	BLOOM	Additive bias in Attention scores	Different interface than standard addition

14.7 Dimension Tracking

Input token_ids: [4, 16]           # 4 sequences, 16 tokens each

Token Embedding:
  lookup: token_embedding([4, 16])
  output: [4, 16, 512]             # each token → 512-dim vector

Position Embedding:
  positions: [0, 1, 2, ..., 15]
  lookup: position_embedding([16])
  output: [16, 512]                # each position → 512-dim vector
  broadcast: [4, 16, 512]          # extend across batch

Addition:
  [4, 16, 512] + [4, 16, 512] = [4, 16, 512]

Final output: [4, 16, 512]         # semantic + positional, same shape

The shape never changes. Every component downstream sees [batch, seq, d_model].

14.8 Common Questions

14.8.1 Does Addition Lose Information?

Not in practice. The key reasons:

d_model = 512 (or 768, or 4096) provides enough dimensions that semantic and positional signals do not interfere catastrophically
Attention's Wq, Wk, Wv projections can learn to extract either signal
Multiple layers of Transformer blocks can progressively refine how the two signals are used

14.8.2 Why Must Positional Encodings Be Small?

If the positional encoding values are much larger than the token embedding values, they overwhelm the semantic signal:

embedding = [0.5,  0.3, -0.2]   # semantic
position  = [10,   20,  -15]    # positional — too large!
combined  = [10.5, 20.3, -15.2] # mostly position, semantics drowned out

Sinusoidal encodings produce values in [-1, 1], matching the typical magnitude of embedding vectors. Learned position embeddings are also kept in a similar range through normal training dynamics.

14.8.3 Learned vs Fixed: Which Is Better?

Type	Advantages	Disadvantages
Fixed (sinusoidal)	Works for any sequence length	Not necessarily optimal
Learned	Can learn task-optimal position patterns	Cannot extrapolate beyond training length

In practice, learned position embeddings tend to perform slightly better on standard benchmarks, which is why GPT and BERT both use them. RoPE and ALiBi are increasingly preferred for long-context models.

14.9 Chapter Summary

14.9.1 Key Concepts

Signal	Source	Represents	Trainable?
Token embedding	Embedding lookup table	Token semantics	Yes
Positional encoding	Position embedding / sinusoidal function	Sequence position	Depends on method
Combined input	Addition of both	Semantics + position	—

14.9.2 Why Addition, Not Concatenation

Dimension stability: d_model stays fixed throughout the architecture
Both signals coexist: high-dimensional vectors have room for multiple signals
Attention can separate them: learned Wq, Wk, Wv projections can specialize per signal

14.9.3 Core Takeaway

Token embeddings and positional encodings are combined by element-wise addition. This design keeps model width constant, lets both signals coexist in a high-dimensional space, and relies on Attention's learned projections to extract whichever signal matters for each relationship. It is a compact, effective design choice — not a limitation.

Chapter Checklist

After this chapter, you should be able to:

Explain what token embeddings and positional encodings each represent.
Explain why concatenation increases model width and why that is a problem.
Explain why addition preserves both signals in a high-dimensional space.
Describe how Attention uses the combined signal.
Name at least three positional encoding variants and their tradeoffs.

See You in the Next Chapter

We now have all the pieces:

Token embeddings (Chapter 4)
Positional encodings (Chapter 5, revisited here)
Attention with Q, K, and V (Chapters 9–12)
Residual connections and Dropout (Chapter 13)
The addition operation that combines content and position (this chapter)

Chapter 15 assembles these into a complete forward pass — tracing a sequence of raw text all the way through to output probabilities, dimension by dimension.