One-sentence summary: MoE decouples "how much the model knows" from "how much compute you spend per token" — and that gap is where all the interesting engineering happens.
30.1 The Core Idea
30.1.1 A Counterintuitive Result
When Mistral AI released Mixtral 8x7B in December 2023, the name confused people. Is it 8 models? Is it 56B parameters? What does "8x7B" mean?
The numbers are:
- Total parameters: 46.7B
- Active parameters per token: 12.9B
The model stores the knowledge of ~47B parameters but spends the compute of ~13B on each token. On most benchmarks, it matches or outperforms LLaMA 2 70B, which activates all 70B of its parameters for every token.
Mixtral 8x7B vs LLaMA 2 70B:
Mixtral 8x7B LLaMA 2 70B
─────────────────────────────────────────────────────
Total parameters 46.7B 70B
Active parameters 12.9B 70B
Inference speed ~6x faster baseline
MMLU 70.6% 68.9%
This result is not magic. It follows from a simple architectural decision: replace the dense FFN with a sparse mixture of experts.
30.1.2 Dense vs Sparse Activation
In a standard Transformer, every token passes through the full FFN. If the FFN has 4B parameters, those 4B parameters run for every single token — whether the token is a period at the end of a sentence or a complex mathematical term.
Dense model:
Input token → [All parameters participate] → output
70B params → 70B activation every token
MoE model:
Input token → [Router selects 2 of 8 experts] → [2 experts compute] → output
46.7B params → 12.9B activation every token
The insight embedded in this: not all knowledge is needed simultaneously. A token that is part of a Python function call does not need the prose-writing specialists active. A token in an English narrative does not need the linear-algebra specialists active.
MoE operationalizes this intuition with a routing mechanism.
30.1.3 The Routing Analogy
Think of a specialized engineering team. When you file a PR:
- A security-focused reviewer looks at authentication changes
- A performance reviewer looks at hot-path code
- A documentation reviewer looks at API changes
Not every reviewer reads every PR. The team lead (router) reads the PR description and assigns the right reviewers.
The routing team covers more expertise than any single generalist reviewer, but the cost per PR is bounded by how many reviewers actually participate.
MoE is the same: N experts, but only K are activated per token. The token gets specialist treatment; the compute stays bounded.
30.1.4 Brief History
MoE was proposed in 1991 by Jacobs et al. It took three decades to reach the frontier:
Timeline:
1991 Jacobs et al. — original MoE concept
2017 Shazeer et al. — Sparsely-Gated MoE, applied to NLP at scale
2021 Google Switch Transformer — 1.6T parameter MoE model
2022 Google GLaM — 1.2T parameters, competitive with GPT-3
2023 Mixtral 8x7B — open-source MoE, practical deployment
2024 DeepSeek-V3 — 671B total, 37B active, $5.5M training cost
The 2021–2024 acceleration is driven by two forces: inference cost becoming the dominant expense in deployed systems, and hardware becoming fast enough to make the routing overhead negligible.
30.2 MoE Architecture
30.2.1 Where MoE Lives in the Stack
The change from Transformer to MoE is surgical. Only one component changes per block:
Standard Transformer Block:
Input
↓
Self-Attention
↓
FFN (dense) ← this becomes MoE
↓
Output
MoE Transformer Block:
Input
↓
Self-Attention
↓
MoE Layer ← router + N expert FFNs
↓
Output
Everything else — the Attention mechanism, residual connections, LayerNorm, positional encoding — stays the same.
30.2.2 Inside the MoE Layer
An MoE layer has two components:
1. Router: a small linear layer that maps the token's hidden representation to a probability distribution over experts.
2. Expert networks: N independent FFNs, each with its own weights.
MoE Layer internals:
x (hidden_size)
↓
┌──────────────┐
│ Router │ Linear(hidden_size → num_experts) + Softmax
└──────┬───────┘
↓
┌──────────────┐
│ Top-K Gate │ Keep only the K highest-probability experts
└──────┬───────┘
↓
Selected K experts receive x:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ E0 │ E1 │ E2 │ E3 │ E4 │ E5 │ E6 │ E7 │ (8 experts total)
└────┴────┴────┴────┴────┴────┴────┴────┘
↓ ↓
Only selected experts compute.
Output = weighted sum of selected expert outputs
30.2.3 Router Mechanics
The router is a single linear layer followed by softmax:
# Input: x, shape = (batch_size, seq_len, hidden_size)
# Step 1: score each expert
router_logits = Linear(hidden_size, num_experts)(x)
# shape: (batch_size, seq_len, num_experts)
# Step 2: softmax over experts
router_probs = softmax(router_logits, dim=-1)
# e.g., for one token: [0.40, 0.30, 0.10, 0.05, 0.05, 0.03, 0.04, 0.03]
# Step 3: select Top-K
top_k_probs, top_k_indices = topk(router_probs, k=2)
# top_k_probs: [0.40, 0.30]
# top_k_indices: [0, 1] → Expert 0 and Expert 1
# Step 4: renormalize weights
weights = top_k_probs / top_k_probs.sum()
# weights: [0.57, 0.43]
The router learns to recognize token types and map them to specialists. The specialization is not programmed — it emerges from training. In practice, Mixtral's router shows measurable specialization:
Observed routing tendencies (Mixtral analysis):
Python keywords → Expert 3, Expert 7 (code specialists)
Mathematical terms → Expert 1, Expert 5 (math specialists)
Function words → Expert 0, Expert 4 (syntax specialists)
30.2.4 Expert Networks
Each expert is a standard FFN:
class Expert(nn.Module):
def __init__(self, hidden_size: int, intermediate_size: int):
super().__init__()
self.w1 = nn.Linear(hidden_size, intermediate_size, bias=False)
self.w2 = nn.Linear(intermediate_size, hidden_size, bias=False)
self.w3 = nn.Linear(hidden_size, intermediate_size, bias=False)
self.act = nn.SiLU()
def forward(self, x):
# SwiGLU gating: element-wise gate from w3 controls w1 output
return self.w2(self.act(self.w1(x)) * self.w3(x))
Each expert is independent — separate weights, separate gradients. Eight experts means eight times the FFN parameters. That is where the extra capacity comes from.
30.2.5 Top-K Selection: Why K=2?
Top-1 (one expert per token):
- Minimum compute
- Gradient flows to only one expert → training instability
- No redundancy if routing is wrong
Top-2 (two experts per token):
- Two experts can complement each other
- Gradient reaches two experts → more stable
- Standard practice for Mixtral and most production systems
Top-K with K > 2:
- Each additional expert reduces sparsity
- When K = N, the model degenerates to a dense FFN
- Compute grows linearly with K
The empirical winner is K = 2. It hits the sweet spot of stability, redundancy, and efficiency.
30.2.6 Load Balancing
Left unconstrained, the router will collapse. It finds a small set of "safe" experts and routes almost everything there:
Collapsed routing (pathological case):
Expert 0: 85% of tokens ← bottleneck, gets all gradients
Expert 1: 9%
Expert 2: 3%
...
Expert 7: 0.1% ← nearly unused, parameters wasted
This creates two problems: Expert 0 becomes a computational bottleneck, and Experts 2–7 are barely trained.
Solution: auxiliary load-balancing loss
Add a penalty that fires when routing is uneven:
def load_balancing_loss(router_probs, expert_indices, num_experts):
# expert_fraction: how often each expert is selected
expert_mask = F.one_hot(expert_indices, num_experts).float()
expert_fraction = expert_mask.mean(dim=(0, 1)) # per-expert selection rate
# router_fraction: the average routing probability per expert
router_fraction = router_probs.mean(dim=(0, 1))
# penalty = num_experts × sum(fraction × probability)
# minimal when distribution is uniform
aux_loss = num_experts * (expert_fraction * router_fraction).sum()
return aux_loss
The intuition: if Expert 0 gets selected 85% of the time (high expert_fraction) AND the router assigns it high probability (high router_fraction), the product is large and the penalty is strong. This pushes the router toward uniform distribution.
The loss is scaled by a coefficient (typically aux_loss_coef = 0.01) and added to the main training loss.
30.3 Mixtral 8x7B
30.3.1 Configuration
| Parameter | Value | Note |
|---|---|---|
| Total parameters | 46.7B | All expert weights included |
| Active parameters | 12.9B | Only 2/8 experts active per token |
| Experts per layer | 8 | Each a full SwiGLU FFN |
| Active experts | 2 | Top-2 routing |
| Hidden dimension | 4096 | Same as LLaMA 2 |
| Layers | 32 | 32 Transformer blocks |
| Attention | GQA, 32 Q heads / 8 KV heads | Grouped-query for efficiency |
| Context length | 32K | With RoPE sliding window |
30.3.2 Parameter Count Derivation
Where do the 46.7B parameters come from?
Embedding layer:
32000 × 4096 ≈ 131M
Per layer:
Self-Attention (GQA):
Q: 4096 × 4096 = 16.8M
K: 4096 × 1024 = 4.2M (8 KV heads × 128 head_dim)
V: 4096 × 1024 = 4.2M
O: 4096 × 4096 = 16.8M
Subtotal: ≈ 42M
MoE layer (8 experts, SwiGLU):
Each expert:
w1: 4096 × 14336 = 58.7M
w2: 14336 × 4096 = 58.7M
w3: 4096 × 14336 = 58.7M (gate)
Subtotal per expert: ≈ 176M
8 experts: ≈ 1,408M = 1.4B
Router: 4096 × 8 ≈ 33K (negligible)
Layer total: 42M + 1,408M ≈ 1.45B
32 layers: 1.45B × 32 ≈ 46.4B
With embedding and LM head: ≈ 46.7B total
Active parameter calculation:
Per token, only 2/8 experts run:
Non-MoE parts (attention × 32 layers): ≈ 1.3B
MoE parts (2/8 experts × 32 layers): ≈ 11.2B
Embedding: ≈ 0.4B
Total active: ≈ 12.9B
This is why Mixtral requires less compute per token than a dense 13B model while possessing the knowledge capacity of a 47B model.
30.3.3 Mixtral vs LLaMA 2 70B
| Metric | Mixtral 8x7B | LLaMA 2 70B |
|---|---|---|
| Total parameters | 46.7B | 70B |
| Active parameters | 12.9B | 70B |
| Inference FLOPs | ~13B equivalent | 70B |
| Tokens/second | ~6× faster | baseline |
| VRAM (FP16) | ~90 GB | ~140 GB |
| MMLU | 70.6% | 68.9% |
| HumanEval (code) | 60.7% | ~30% |
| Multilingual | Strong | Moderate |
The efficiency difference is significant. For a serving deployment that processes 10M tokens per day, Mixtral spends roughly 1/6th the compute of LLaMA 2 70B for comparable or better quality.
30.3.4 Observed Router Behavior
Mistral's analysis of the trained router found real specialization:
Experts develop distinct domains even though no labels were assigned. The router learns which experts are best for which token types through training signal alone.
Position matters: "The" at the start of a sentence may route differently than "the" in the middle. The router is sensitive to syntactic context, not just the token identity.
Adjacent tokens diversify: neighboring tokens in a sequence tend to select different expert subsets, suggesting the model learned a form of implicit division of labor across the sequence.
30.4 DeepSeek-V3: Pushing MoE Further
30.4.1 The Cost Story
DeepSeek-V3 (December 2024) trained a 671B-parameter model for $5.5M. GPT-4's estimated training cost exceeds $100M. Same ballpark of capability, 18× cheaper. The gap comes from architecture efficiency: Multi-head Latent Attention (MLA) and fine-grained MoE.
30.4.2 Configuration
| Parameter | DeepSeek-V3 | Note |
|---|---|---|
| Total parameters | 671B | Very large total capacity |
| Active parameters | 37B | Per-token compute stays manageable |
| Routed experts | 256 | Fine-grained specialization |
| Shared expert | 1 | Always active, universal backbone |
| Active routed experts | 8 | Top-8 from 256 |
| Layers | 61 | Deeper than Mixtral |
| Hidden dimension | 7168 | |
| Context length | 128K |
30.4.3 Multi-head Latent Attention (MLA)
For a 128K context window, the KV cache becomes the binding constraint:
Standard MHA KV cache:
Size ∝ num_heads × head_dim × seq_len × num_layers
At 128K: enormous, fills GPU memory
MLA KV cache:
Compress K and V into a low-dimensional latent vector c_KV
Cache c_KV instead of the full K and V
Decompress at attention time
MLA applies a low-rank projection:
Standard MHA path:
x → W_K → K (caches full K)
x → W_V → V (caches full V)
KV Cache ∝ num_heads × head_dim
MLA path:
x → W_DKV → c_KV (compress)
c_KV cached (much smaller)
c_KV → W_UK → K (decompress at compute time)
c_KV → W_UV → V
KV Cache ∝ latent_dim (latent_dim << num_heads × head_dim)
If latent_dim = 0.25 × (num_heads × head_dim), KV cache shrinks by 75%. At 128K context, this is the difference between fitting and not fitting on a single node.
30.4.4 Fine-Grained MoE
Mixtral uses 8 large experts. DeepSeek-V3 uses 256 small experts. The distinction matters:
Coarse-grained (8 large experts, Top-2):
- Each expert is a full-size FFN
- 2/8 = 25% activation rate
- Routing decisions are coarse
Fine-grained (256 small experts, Top-8):
- Each expert is a fraction of a full FFN
- 8/256 ≈ 3% activation rate
- Much more precise routing
- Better load balancing over more experts
The total compute per token stays similar (8 small experts can equal 2 large experts in FLOPs), but the routing granularity is 32× finer. This means the model can make much more precise decisions about which specialist to use.
30.4.5 Shared Expert
DeepSeek-V3 adds one expert that is always active for every token:
DeepSeek-V3 MoE:
x
├── Router → selects 8 from 256 routed experts
│ ↓
│ routed_output
│
└── Shared Expert (always active)
↓
shared_output
final_output = routed_output + shared_output
The shared expert handles universal patterns — common grammar, standard reasoning steps, frequent subwords — that every token needs regardless of its domain. The routed experts handle the differentiated, domain-specific computation.
This prevents the routed experts from wasting capacity on common-case patterns.
30.4.6 What $5.5M Bought
The training cost breakdown:
- FP8 mixed precision: halves memory bandwidth per operation
- MLA: larger batch sizes due to smaller KV cache
- Compute-communication overlap: computation proceeds during gradient all-reduce
- Expert parallelism: 256 experts shard cleanly across the 2048-GPU cluster
- 14.8 trillion training tokens: high-quality data, multi-stage curriculum
The combination achieves GPT-4-class benchmark results with roughly 1/18th the estimated training cost. The architectural choices compound.
30.5 MoE Challenges
30.5.1 Training Instability
MoE models are harder to train than dense models of similar compute:
Router collapse: the router concentrates traffic on a few experts, those experts receive all gradients, other experts stop being trained, and the problem compounds. Defense: load-balancing loss, initialization noise, and expert dropout.
Loss spikes: routing decisions change sharply between batches, causing gradient discontinuities. Defense: gradient clipping, smaller learning rate, larger batch size.
Expert starvation: some experts never receive enough tokens to train properly. Defense: capacity factors that force re-routing to less-used experts.
30.5.2 Load Imbalance in Practice
Even with the auxiliary loss, perfect balance is not guaranteed:
Realistic routing distribution (after training):
Expert 0: 18% ← moderately popular
Expert 1: 14%
Expert 2: 13%
Expert 3: 12%
Expert 4: 11%
Expert 5: 11%
Expert 6: 10%
Expert 7: 11%
vs ideal uniform:
Each expert: 12.5%
This is tolerable. But in a distributed setting, if Expert 0 is on GPU 0 and Expert 7 is on GPU 7, the imbalance translates directly to compute latency.
Capacity factor: a hard cap on how many tokens each expert can handle per batch. Tokens that overflow are dropped or redirected. Common values: 1.0–1.5.
capacity = (total_tokens / num_experts) * capacity_factor
# If expert_queue > capacity: excess tokens are handled by next-best expert
30.5.3 All-to-All Communication
In distributed training, experts are sharded across GPUs. Token routing crosses GPU boundaries:
Setup: 4 GPUs, 2 experts each
GPU 0: Expert 0, 1
GPU 1: Expert 2, 3
GPU 2: Expert 4, 5
GPU 3: Expert 6, 7
A batch on GPU 0 may route tokens to Expert 5 (GPU 2) and Expert 7 (GPU 3):
GPU 0 → GPU 2: send token activations
GPU 2 → GPU 0: return computed results
(All GPUs do this simultaneously → all-to-all pattern)
All-to-all has communication cost, occurring twice per MoE layer (once to send, once to receive). This can dominate the wall-clock time if not handled carefully.
Mitigations: compute-communication overlap, group expert placement to minimize inter-node traffic, and batching tokens before dispatch.
30.5.4 Serving Complexity
Dynamic batching is harder: in a dense model, all tokens in a batch follow the same compute path. In a MoE model, different tokens activate different experts. Batching strategies that work for dense models may fragment badly under MoE routing.
Memory profile: all expert weights must reside in memory even though only 2–8 experts are active per token. Mixtral requires ~90 GB VRAM for FP16 inference despite only 12.9B active parameters. The "light compute" benefit does not translate to proportionally reduced VRAM.
30.6 MoE Implementation
30.6.1 Core MoE Layer
import torch
import torch.nn as nn
import torch.nn.functional as F
class Expert(nn.Module):
def __init__(self, hidden_size: int, intermediate_size: int):
super().__init__()
self.w1 = nn.Linear(hidden_size, intermediate_size, bias=False)
self.w2 = nn.Linear(intermediate_size, hidden_size, bias=False)
self.w3 = nn.Linear(hidden_size, intermediate_size, bias=False)
self.act = nn.SiLU()
def forward(self, x):
return self.w2(self.act(self.w1(x)) * self.w3(x))
class MoELayer(nn.Module):
def __init__(
self,
hidden_size: int = 4096,
intermediate_size: int = 14336,
num_experts: int = 8,
top_k: int = 2,
aux_loss_coef: float = 0.01,
):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.aux_loss_coef = aux_loss_coef
self.router = nn.Linear(hidden_size, num_experts, bias=False)
self.experts = nn.ModuleList([
Expert(hidden_size, intermediate_size) for _ in range(num_experts)
])
def forward(self, x):
batch, seq_len, hidden_size = x.shape
# Router: score and select
router_logits = self.router(x)
router_probs = F.softmax(router_logits, dim=-1)
top_k_probs, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)
top_k_weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
# Dispatch: send tokens to selected experts
x_flat = x.view(-1, hidden_size)
output = torch.zeros_like(x_flat)
for expert_idx in range(self.num_experts):
# Which tokens route to this expert?
expert_mask = (top_k_indices == expert_idx).any(dim=-1).view(-1)
if not expert_mask.any():
continue
expert_input = x_flat[expert_mask]
expert_output = self.experts[expert_idx](expert_input)
# Weight and accumulate
weights = torch.where(
top_k_indices == expert_idx, top_k_weights,
torch.zeros_like(top_k_weights),
).sum(dim=-1).view(-1)[expert_mask]
output[expert_mask] += expert_output * weights.unsqueeze(-1)
output = output.view(batch, seq_len, hidden_size)
aux_loss = self._load_balance_loss(router_probs, top_k_indices)
return output, aux_loss
def _load_balance_loss(self, router_probs, expert_indices):
expert_mask = F.one_hot(expert_indices, self.num_experts).float()
expert_fraction = expert_mask.sum(dim=2).mean(dim=(0, 1))
router_fraction = router_probs.mean(dim=(0, 1))
aux_loss = self.num_experts * (expert_fraction * router_fraction).sum()
return aux_loss * self.aux_loss_coef
30.6.2 Noisy Router (Training Stability)
Adding noise during training encourages the router to explore all experts early in training:
class NoisyTopKRouter(nn.Module):
def __init__(self, hidden_size, num_experts, top_k, noise_std=0.1):
super().__init__()
self.top_k = top_k
self.noise_std = noise_std
self.gate = nn.Linear(hidden_size, num_experts, bias=False)
def forward(self, x, training=True):
logits = self.gate(x)
if training and self.noise_std > 0:
logits = logits + torch.randn_like(logits) * self.noise_std
probs = F.softmax(logits, dim=-1)
top_k_probs, top_k_indices = torch.topk(probs, self.top_k, dim=-1)
weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
return weights, top_k_indices, probs
30.6.3 Loading Mixtral with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto", # distributes across available GPUs
load_in_4bit=True, # reduces VRAM to ~25 GB for 4-bit
)
# Inspect the MoE structure
moe_layer = model.model.layers[0].block_sparse_moe
print(f"Router: {moe_layer.gate}")
# Router: Linear(in_features=4096, out_features=8, bias=False)
print(f"Expert count: {len(moe_layer.experts)}")
# Expert count: 8
print(f"Expert 0: {moe_layer.experts[0]}")
# MixtralBlockSparseTop2MLP(
# (w1): Linear(4096 → 14336, bias=False)
# (w2): Linear(14336 → 4096, bias=False)
# (w3): Linear(4096 → 14336, bias=False)
# )
30.7 MoE vs Dense: When to Use Which
30.7.1 Parameter and Activation Counts
Model Total params Active params Activation %
────────────────────────────────────────────────────────────
LLaMA 2 70B 70B 70B 100%
Mixtral 8x7B 46.7B 12.9B 27.6%
DeepSeek-V3 671B 37B 5.5%
GPT-4 (rumored) ~1.8T ~110B ~6%
As total parameter count grows, the efficient frontier increasingly favors MoE.
30.7.2 Training Cost
| Model | Estimated training cost | Tokens | Hardware |
|---|---|---|---|
| LLaMA 2 70B | ~$5M | 2T | A100 |
| Mixtral 8x7B | ~$2M (estimated) | undisclosed | undisclosed |
| DeepSeek-V3 | $5.5M | 14.8T | H800 (2048 GPUs) |
| GPT-4 | >$100M (rumored) | 13T+ | A100/H100 |
MoE achieves training efficiency through two mechanisms: fewer FLOPs per token (only active experts compute), and better use of the compute budget (more parameters = more capacity for the same compute).
30.7.3 Inference Efficiency
| Metric | Dense 70B | MoE 8x7B (12.9B active) |
|---|---|---|
| Time to first token | baseline | ~0.2× |
| Throughput | baseline | ~3–4× |
| VRAM (FP16) | ~140 GB | ~90 GB |
| Tokens/second | baseline | ~6× |
The throughput advantage is real. The latency advantage exists but is smaller. The VRAM advantage is also real but does not scale with active-parameter count — you must load all experts.
30.7.4 When to Choose Dense
- Sequence lengths under 4K tokens
- Memory-constrained deployment (inference VRAM budget is the binding constraint)
- Single-task fine-tuning (MoE's multi-domain knowledge is wasted)
- Simpler serving stack is worth more than the efficiency gain
30.7.5 When to Choose MoE
- High throughput requirements (API serving, search augmentation)
- Multilingual or multi-domain tasks
- Available VRAM exceeds what the dense model needs
- Training budget is constrained but you want more total capacity
30.8 Chapter Summary
30.8.1 Key Concepts
| Concept | Meaning |
|---|---|
| MoE | Mixture of Experts — sparse activation for efficient large models |
| Sparse activation | Only a subset of parameters compute for each token |
| Router | Linear layer that assigns tokens to experts via Top-K selection |
| Expert | An independent FFN network with its own parameters |
| Top-K | Select K highest-scoring experts per token (typically K=2) |
| Load balancing | Auxiliary loss that encourages uniform expert utilization |
| MLA | Multi-head Latent Attention — compresses KV cache via low-rank projection |
| Fine-grained MoE | Many small experts instead of few large ones; lower activation rate |
| Shared expert | One expert always active; handles universal token patterns |
30.8.2 Key Numbers
Mixtral 8x7B:
Total params: 46.7B | Active: 12.9B (27.6%)
Experts: 8 | Active per token: 2
Result: matches LLaMA 2 70B at ~6× faster inference
DeepSeek-V3:
Total params: 671B | Active: 37B (5.5%)
Experts: 256 + 1 | Active per token: 8 + 1
Training cost: $5.5M (GPT-4: >$100M)
30.8.3 Core Formulas
Router computation:
router_logits = Linear(x) # hidden_size → num_experts
router_probs = softmax(router_logits)
top_k_weights, top_k_indices = topk(router_probs, k)
MoE output:
Load-balancing loss:
where is expert selection frequency and is mean routing probability.
30.8.4 My Take
MoE is the clearest example of "frontier AI is systems engineering." The algorithm — route each token to K experts, train with a load-balancing penalty — is not complicated. What is hard is making it work at scale: routing tokens across hundreds of GPUs without all-to-all communication becoming the bottleneck, debugging router collapse in a 671B model, and building a serving stack that handles the dynamic batching pathology.
DeepSeek-V3 is important not because it is cheaper per inference, but because the $5.5M training figure proves that frontier capability is no longer exclusively a function of training budget. Architectural efficiency compounds.
Chapter Checklist
After this chapter, you should be able to:
- Explain sparse activation and why it decouples total capacity from per-token compute.
- Describe the MoE layer structure: router, Top-K gating, and expert FFNs.
- Explain why K=2 is the standard choice for Top-K selection.
- Explain load-balancing loss and what happens without it.
- Calculate active and total parameter counts for Mixtral 8x7B.
- Explain MLA and why it matters for long-context MoE models.
- Name at least two MoE failure modes and their mitigations.
See You in the Next Chapter
MoE is about spending compute wisely during training and inference. There is another dimension entirely: spending more compute at inference time to get better answers. Chapter 31 explains the reasoning-model revolution — from GPT-4o's 12% on AIME 2024 to o3's 96.7%, and the open-source story that DeepSeek-R1 made possible.