One-sentence summary: MoE decouples "how much the model knows" from "how much compute you spend per token" — and that gap is where all the interesting engineering happens.

30.1 The Core Idea

30.1.1 A Counterintuitive Result

When Mistral AI released Mixtral 8x7B in December 2023, the name confused people. Is it 8 models? Is it 56B parameters? What does "8x7B" mean?

The numbers are:

Total parameters: 46.7B
Active parameters per token: 12.9B

The model stores the knowledge of ~47B parameters but spends the compute of ~13B on each token. On most benchmarks, it matches or outperforms LLaMA 2 70B, which activates all 70B of its parameters for every token.

Mixtral 8x7B vs LLaMA 2 70B:

                      Mixtral 8x7B    LLaMA 2 70B
─────────────────────────────────────────────────────
Total parameters         46.7B          70B
Active parameters        12.9B          70B
Inference speed          ~6x faster     baseline
MMLU                     70.6%          68.9%

This result is not magic. It follows from a simple architectural decision: replace the dense FFN with a sparse mixture of experts.

30.1.2 Dense vs Sparse Activation

In a standard Transformer, every token passes through the full FFN. If the FFN has 4B parameters, those 4B parameters run for every single token — whether the token is a period at the end of a sentence or a complex mathematical term.

Dense model:

Input token → [All parameters participate] → output
70B params  →  70B activation every token

MoE model:

Input token → [Router selects 2 of 8 experts] → [2 experts compute] → output
46.7B params → 12.9B activation every token

The insight embedded in this: not all knowledge is needed simultaneously. A token that is part of a Python function call does not need the prose-writing specialists active. A token in an English narrative does not need the linear-algebra specialists active.

MoE operationalizes this intuition with a routing mechanism.

30.1.3 The Routing Analogy

Think of a specialized engineering team. When you file a PR:

A security-focused reviewer looks at authentication changes
A performance reviewer looks at hot-path code
A documentation reviewer looks at API changes

Not every reviewer reads every PR. The team lead (router) reads the PR description and assigns the right reviewers.

The routing team covers more expertise than any single generalist reviewer, but the cost per PR is bounded by how many reviewers actually participate.

MoE is the same: N experts, but only K are activated per token. The token gets specialist treatment; the compute stays bounded.

30.1.4 Brief History

MoE was proposed in 1991 by Jacobs et al. It took three decades to reach the frontier:

Timeline:
1991  Jacobs et al. — original MoE concept
2017  Shazeer et al. — Sparsely-Gated MoE, applied to NLP at scale
2021  Google Switch Transformer — 1.6T parameter MoE model
2022  Google GLaM — 1.2T parameters, competitive with GPT-3
2023  Mixtral 8x7B — open-source MoE, practical deployment
2024  DeepSeek-V3 — 671B total, 37B active, $5.5M training cost

The 2021–2024 acceleration is driven by two forces: inference cost becoming the dominant expense in deployed systems, and hardware becoming fast enough to make the routing overhead negligible.

30.2 MoE Architecture

30.2.1 Where MoE Lives in the Stack

The change from Transformer to MoE is surgical. Only one component changes per block:

Standard Transformer Block:
  Input
    ↓
  Self-Attention
    ↓
  FFN (dense)   ← this becomes MoE
    ↓
  Output

MoE Transformer Block:
  Input
    ↓
  Self-Attention
    ↓
  MoE Layer     ← router + N expert FFNs
    ↓
  Output

Everything else — the Attention mechanism, residual connections, LayerNorm, positional encoding — stays the same.

30.2.2 Inside the MoE Layer

An MoE layer has two components:

1. Router: a small linear layer that maps the token's hidden representation to a probability distribution over experts.

2. Expert networks: N independent FFNs, each with its own weights.

MoE Layer internals:

  x (hidden_size)
      ↓
  ┌──────────────┐
  │   Router     │  Linear(hidden_size → num_experts) + Softmax
  └──────┬───────┘
         ↓
  ┌──────────────┐
  │  Top-K Gate  │  Keep only the K highest-probability experts
  └──────┬───────┘
         ↓
  Selected K experts receive x:
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │ E0 │ E1 │ E2 │ E3 │ E4 │ E5 │ E6 │ E7 │   (8 experts total)
  └────┴────┴────┴────┴────┴────┴────┴────┘
         ↓        ↓
  Only selected experts compute.

  Output = weighted sum of selected expert outputs

30.2.3 Router Mechanics

The router is a single linear layer followed by softmax:

# Input: x, shape = (batch_size, seq_len, hidden_size)

# Step 1: score each expert
router_logits = Linear(hidden_size, num_experts)(x)
# shape: (batch_size, seq_len, num_experts)

# Step 2: softmax over experts
router_probs = softmax(router_logits, dim=-1)
# e.g., for one token: [0.40, 0.30, 0.10, 0.05, 0.05, 0.03, 0.04, 0.03]

# Step 3: select Top-K
top_k_probs, top_k_indices = topk(router_probs, k=2)
# top_k_probs:   [0.40, 0.30]
# top_k_indices: [0, 1]     → Expert 0 and Expert 1

# Step 4: renormalize weights
weights = top_k_probs / top_k_probs.sum()
# weights: [0.57, 0.43]

The router learns to recognize token types and map them to specialists. The specialization is not programmed — it emerges from training. In practice, Mixtral's router shows measurable specialization:

Observed routing tendencies (Mixtral analysis):
Python keywords     → Expert 3, Expert 7  (code specialists)
Mathematical terms  → Expert 1, Expert 5  (math specialists)
Function words      → Expert 0, Expert 4  (syntax specialists)

30.2.4 Expert Networks

Each expert is a standard FFN:

class Expert(nn.Module):
    def __init__(self, hidden_size: int, intermediate_size: int):
        super().__init__()
        self.w1 = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.w2 = nn.Linear(intermediate_size, hidden_size, bias=False)
        self.w3 = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.act = nn.SiLU()

    def forward(self, x):
        # SwiGLU gating: element-wise gate from w3 controls w1 output
        return self.w2(self.act(self.w1(x)) * self.w3(x))

Each expert is independent — separate weights, separate gradients. Eight experts means eight times the FFN parameters. That is where the extra capacity comes from.

30.2.5 Top-K Selection: Why K=2?

Top-1 (one expert per token):

Minimum compute
Gradient flows to only one expert → training instability
No redundancy if routing is wrong

Top-2 (two experts per token):

Two experts can complement each other
Gradient reaches two experts → more stable
Standard practice for Mixtral and most production systems

Top-K with K > 2:

Each additional expert reduces sparsity
When K = N, the model degenerates to a dense FFN
Compute grows linearly with K

The empirical winner is K = 2. It hits the sweet spot of stability, redundancy, and efficiency.

30.2.6 Load Balancing

Left unconstrained, the router will collapse. It finds a small set of "safe" experts and routes almost everything there:

Collapsed routing (pathological case):
  Expert 0: 85% of tokens  ← bottleneck, gets all gradients
  Expert 1: 9%
  Expert 2: 3%
  ...
  Expert 7: 0.1%            ← nearly unused, parameters wasted

This creates two problems: Expert 0 becomes a computational bottleneck, and Experts 2–7 are barely trained.

Solution: auxiliary load-balancing loss

Add a penalty that fires when routing is uneven:

def load_balancing_loss(router_probs, expert_indices, num_experts):
    # expert_fraction: how often each expert is selected
    expert_mask     = F.one_hot(expert_indices, num_experts).float()
    expert_fraction = expert_mask.mean(dim=(0, 1))   # per-expert selection rate

    # router_fraction: the average routing probability per expert
    router_fraction = router_probs.mean(dim=(0, 1))

    # penalty = num_experts × sum(fraction × probability)
    # minimal when distribution is uniform
    aux_loss = num_experts * (expert_fraction * router_fraction).sum()
    return aux_loss

The intuition: if Expert 0 gets selected 85% of the time (high expert_fraction) AND the router assigns it high probability (high router_fraction), the product is large and the penalty is strong. This pushes the router toward uniform distribution.

The loss is scaled by a coefficient (typically aux_loss_coef = 0.01) and added to the main training loss.

30.3 Mixtral 8x7B

30.3.1 Configuration

Parameter	Value	Note
Total parameters	46.7B	All expert weights included
Active parameters	12.9B	Only 2/8 experts active per token
Experts per layer	8	Each a full SwiGLU FFN
Active experts	2	Top-2 routing
Hidden dimension	4096	Same as LLaMA 2
Layers	32	32 Transformer blocks
Attention	GQA, 32 Q heads / 8 KV heads	Grouped-query for efficiency
Context length	32K	With RoPE sliding window

30.3.2 Parameter Count Derivation

Where do the 46.7B parameters come from?

Embedding layer:
  32000 × 4096 ≈ 131M

Per layer:
  Self-Attention (GQA):
    Q: 4096 × 4096     = 16.8M
    K: 4096 × 1024     =  4.2M  (8 KV heads × 128 head_dim)
    V: 4096 × 1024     =  4.2M
    O: 4096 × 4096     = 16.8M
    Subtotal: ≈ 42M

  MoE layer (8 experts, SwiGLU):
    Each expert:
      w1: 4096 × 14336 = 58.7M
      w2: 14336 × 4096 = 58.7M
      w3: 4096 × 14336 = 58.7M  (gate)
      Subtotal per expert: ≈ 176M
    8 experts: ≈ 1,408M = 1.4B
    Router: 4096 × 8 ≈ 33K (negligible)

  Layer total: 42M + 1,408M ≈ 1.45B

32 layers: 1.45B × 32 ≈ 46.4B

With embedding and LM head: ≈ 46.7B total

Active parameter calculation:

Per token, only 2/8 experts run:
  Non-MoE parts (attention × 32 layers): ≈ 1.3B
  MoE parts (2/8 experts × 32 layers):   ≈ 11.2B
  Embedding:                              ≈ 0.4B

Total active: ≈ 12.9B

This is why Mixtral requires less compute per token than a dense 13B model while possessing the knowledge capacity of a 47B model.

30.3.3 Mixtral vs LLaMA 2 70B

Metric	Mixtral 8x7B	LLaMA 2 70B
Total parameters	46.7B	70B
Active parameters	12.9B	70B
Inference FLOPs	~13B equivalent	70B
Tokens/second	~6× faster	baseline
VRAM (FP16)	~90 GB	~140 GB
MMLU	70.6%	68.9%
HumanEval (code)	60.7%	~30%
Multilingual	Strong	Moderate

The efficiency difference is significant. For a serving deployment that processes 10M tokens per day, Mixtral spends roughly 1/6th the compute of LLaMA 2 70B for comparable or better quality.

30.3.4 Observed Router Behavior

Mistral's analysis of the trained router found real specialization:

Experts develop distinct domains even though no labels were assigned. The router learns which experts are best for which token types through training signal alone.

Position matters: "The" at the start of a sentence may route differently than "the" in the middle. The router is sensitive to syntactic context, not just the token identity.

Adjacent tokens diversify: neighboring tokens in a sequence tend to select different expert subsets, suggesting the model learned a form of implicit division of labor across the sequence.

30.4 DeepSeek-V3: Pushing MoE Further

30.4.1 The Cost Story

DeepSeek-V3 (December 2024) trained a 671B-parameter model for $5.5M. GPT-4's estimated training cost exceeds $100M. Same ballpark of capability, 18× cheaper. The gap comes from architecture efficiency: Multi-head Latent Attention (MLA) and fine-grained MoE.

30.4.2 Configuration

Parameter	DeepSeek-V3	Note
Total parameters	671B	Very large total capacity
Active parameters	37B	Per-token compute stays manageable
Routed experts	256	Fine-grained specialization
Shared expert	1	Always active, universal backbone
Active routed experts	8	Top-8 from 256
Layers	61	Deeper than Mixtral
Hidden dimension	7168
Context length	128K

30.4.3 Multi-head Latent Attention (MLA)

For a 128K context window, the KV cache becomes the binding constraint:

Standard MHA KV cache:
  Size ∝ num_heads × head_dim × seq_len × num_layers
  At 128K: enormous, fills GPU memory

MLA KV cache:
  Compress K and V into a low-dimensional latent vector c_KV
  Cache c_KV instead of the full K and V
  Decompress at attention time

MLA applies a low-rank projection:

Standard MHA path:
  x → W_K → K    (caches full K)
  x → W_V → V    (caches full V)
  KV Cache ∝ num_heads × head_dim

MLA path:
  x → W_DKV → c_KV     (compress)
  c_KV cached            (much smaller)
  c_KV → W_UK → K       (decompress at compute time)
  c_KV → W_UV → V
  KV Cache ∝ latent_dim (latent_dim << num_heads × head_dim)

If latent_dim = 0.25 × (num_heads × head_dim), KV cache shrinks by 75%. At 128K context, this is the difference between fitting and not fitting on a single node.

30.4.4 Fine-Grained MoE

Mixtral uses 8 large experts. DeepSeek-V3 uses 256 small experts. The distinction matters:

Coarse-grained (8 large experts, Top-2):

Each expert is a full-size FFN
2/8 = 25% activation rate
Routing decisions are coarse

Fine-grained (256 small experts, Top-8):

Each expert is a fraction of a full FFN
8/256 ≈ 3% activation rate
Much more precise routing
Better load balancing over more experts

The total compute per token stays similar (8 small experts can equal 2 large experts in FLOPs), but the routing granularity is 32× finer. This means the model can make much more precise decisions about which specialist to use.

30.4.5 Shared Expert

DeepSeek-V3 adds one expert that is always active for every token:

DeepSeek-V3 MoE:
  x
  ├── Router → selects 8 from 256 routed experts
  │       ↓
  │   routed_output
  │
  └── Shared Expert (always active)
          ↓
      shared_output

final_output = routed_output + shared_output

The shared expert handles universal patterns — common grammar, standard reasoning steps, frequent subwords — that every token needs regardless of its domain. The routed experts handle the differentiated, domain-specific computation.

This prevents the routed experts from wasting capacity on common-case patterns.

30.4.6 What $5.5M Bought

The training cost breakdown:

FP8 mixed precision: halves memory bandwidth per operation
MLA: larger batch sizes due to smaller KV cache
Compute-communication overlap: computation proceeds during gradient all-reduce
Expert parallelism: 256 experts shard cleanly across the 2048-GPU cluster
14.8 trillion training tokens: high-quality data, multi-stage curriculum

The combination achieves GPT-4-class benchmark results with roughly 1/18th the estimated training cost. The architectural choices compound.

30.5 MoE Challenges

30.5.1 Training Instability

MoE models are harder to train than dense models of similar compute:

Router collapse: the router concentrates traffic on a few experts, those experts receive all gradients, other experts stop being trained, and the problem compounds. Defense: load-balancing loss, initialization noise, and expert dropout.

Loss spikes: routing decisions change sharply between batches, causing gradient discontinuities. Defense: gradient clipping, smaller learning rate, larger batch size.

Expert starvation: some experts never receive enough tokens to train properly. Defense: capacity factors that force re-routing to less-used experts.

30.5.2 Load Imbalance in Practice

Even with the auxiliary loss, perfect balance is not guaranteed:

Realistic routing distribution (after training):
  Expert 0: 18%   ← moderately popular
  Expert 1: 14%
  Expert 2: 13%
  Expert 3: 12%
  Expert 4: 11%
  Expert 5: 11%
  Expert 6: 10%
  Expert 7: 11%

vs ideal uniform:
  Each expert: 12.5%

This is tolerable. But in a distributed setting, if Expert 0 is on GPU 0 and Expert 7 is on GPU 7, the imbalance translates directly to compute latency.

Capacity factor: a hard cap on how many tokens each expert can handle per batch. Tokens that overflow are dropped or redirected. Common values: 1.0–1.5.

capacity = (total_tokens / num_experts) * capacity_factor
# If expert_queue > capacity: excess tokens are handled by next-best expert

30.5.3 All-to-All Communication

In distributed training, experts are sharded across GPUs. Token routing crosses GPU boundaries:

Setup: 4 GPUs, 2 experts each
  GPU 0: Expert 0, 1
  GPU 1: Expert 2, 3
  GPU 2: Expert 4, 5
  GPU 3: Expert 6, 7

A batch on GPU 0 may route tokens to Expert 5 (GPU 2) and Expert 7 (GPU 3):
  GPU 0 → GPU 2: send token activations
  GPU 2 → GPU 0: return computed results
  (All GPUs do this simultaneously → all-to-all pattern)

All-to-all has $O(\text{batch} \times \text{hidden\_size})$ communication cost, occurring twice per MoE layer (once to send, once to receive). This can dominate the wall-clock time if not handled carefully.

Mitigations: compute-communication overlap, group expert placement to minimize inter-node traffic, and batching tokens before dispatch.

30.5.4 Serving Complexity

Dynamic batching is harder: in a dense model, all tokens in a batch follow the same compute path. In a MoE model, different tokens activate different experts. Batching strategies that work for dense models may fragment badly under MoE routing.

Memory profile: all expert weights must reside in memory even though only 2–8 experts are active per token. Mixtral requires ~90 GB VRAM for FP16 inference despite only 12.9B active parameters. The "light compute" benefit does not translate to proportionally reduced VRAM.

30.6 MoE Implementation

30.6.1 Core MoE Layer

import torch
import torch.nn as nn
import torch.nn.functional as F

class Expert(nn.Module):
    def __init__(self, hidden_size: int, intermediate_size: int):
        super().__init__()
        self.w1  = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.w2  = nn.Linear(intermediate_size, hidden_size, bias=False)
        self.w3  = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.act = nn.SiLU()

    def forward(self, x):
        return self.w2(self.act(self.w1(x)) * self.w3(x))


class MoELayer(nn.Module):
    def __init__(
        self,
        hidden_size:       int   = 4096,
        intermediate_size: int   = 14336,
        num_experts:       int   = 8,
        top_k:             int   = 2,
        aux_loss_coef:     float = 0.01,
    ):
        super().__init__()
        self.num_experts   = num_experts
        self.top_k         = top_k
        self.aux_loss_coef = aux_loss_coef

        self.router  = nn.Linear(hidden_size, num_experts, bias=False)
        self.experts = nn.ModuleList([
            Expert(hidden_size, intermediate_size) for _ in range(num_experts)
        ])

    def forward(self, x):
        batch, seq_len, hidden_size = x.shape

        # Router: score and select
        router_logits = self.router(x)
        router_probs  = F.softmax(router_logits, dim=-1)
        top_k_probs, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)
        top_k_weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Dispatch: send tokens to selected experts
        x_flat = x.view(-1, hidden_size)
        output  = torch.zeros_like(x_flat)

        for expert_idx in range(self.num_experts):
            # Which tokens route to this expert?
            expert_mask      = (top_k_indices == expert_idx).any(dim=-1).view(-1)
            if not expert_mask.any():
                continue
            expert_input  = x_flat[expert_mask]
            expert_output = self.experts[expert_idx](expert_input)

            # Weight and accumulate
            weights = torch.where(
                top_k_indices == expert_idx, top_k_weights,
                torch.zeros_like(top_k_weights),
            ).sum(dim=-1).view(-1)[expert_mask]
            output[expert_mask] += expert_output * weights.unsqueeze(-1)

        output   = output.view(batch, seq_len, hidden_size)
        aux_loss = self._load_balance_loss(router_probs, top_k_indices)
        return output, aux_loss

    def _load_balance_loss(self, router_probs, expert_indices):
        expert_mask     = F.one_hot(expert_indices, self.num_experts).float()
        expert_fraction = expert_mask.sum(dim=2).mean(dim=(0, 1))
        router_fraction = router_probs.mean(dim=(0, 1))
        aux_loss = self.num_experts * (expert_fraction * router_fraction).sum()
        return aux_loss * self.aux_loss_coef

30.6.2 Noisy Router (Training Stability)

Adding noise during training encourages the router to explore all experts early in training:

class NoisyTopKRouter(nn.Module):
    def __init__(self, hidden_size, num_experts, top_k, noise_std=0.1):
        super().__init__()
        self.top_k     = top_k
        self.noise_std = noise_std
        self.gate      = nn.Linear(hidden_size, num_experts, bias=False)

    def forward(self, x, training=True):
        logits = self.gate(x)
        if training and self.noise_std > 0:
            logits = logits + torch.randn_like(logits) * self.noise_std
        probs = F.softmax(logits, dim=-1)
        top_k_probs, top_k_indices = torch.topk(probs, self.top_k, dim=-1)
        weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
        return weights, top_k_indices, probs

30.6.3 Loading Mixtral with HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",    # distributes across available GPUs
    load_in_4bit=True,    # reduces VRAM to ~25 GB for 4-bit
)

# Inspect the MoE structure
moe_layer = model.model.layers[0].block_sparse_moe
print(f"Router: {moe_layer.gate}")
# Router: Linear(in_features=4096, out_features=8, bias=False)
print(f"Expert count: {len(moe_layer.experts)}")
# Expert count: 8
print(f"Expert 0: {moe_layer.experts[0]}")
# MixtralBlockSparseTop2MLP(
#   (w1): Linear(4096 → 14336, bias=False)
#   (w2): Linear(14336 → 4096, bias=False)
#   (w3): Linear(4096 → 14336, bias=False)
# )

30.7 MoE vs Dense: When to Use Which

30.7.1 Parameter and Activation Counts

Model           Total params    Active params    Activation %
────────────────────────────────────────────────────────────
LLaMA 2 70B     70B             70B              100%
Mixtral 8x7B    46.7B           12.9B            27.6%
DeepSeek-V3     671B            37B              5.5%
GPT-4 (rumored) ~1.8T           ~110B            ~6%

As total parameter count grows, the efficient frontier increasingly favors MoE.

30.7.2 Training Cost

Model	Estimated training cost	Tokens	Hardware
LLaMA 2 70B	~$5M	2T	A100
Mixtral 8x7B	~$2M (estimated)	undisclosed	undisclosed
DeepSeek-V3	$5.5M	14.8T	H800 (2048 GPUs)
GPT-4	>$100M (rumored)	13T+	A100/H100

MoE achieves training efficiency through two mechanisms: fewer FLOPs per token (only active experts compute), and better use of the compute budget (more parameters = more capacity for the same compute).

30.7.3 Inference Efficiency

Metric	Dense 70B	MoE 8x7B (12.9B active)
Time to first token	baseline	~0.2×
Throughput	baseline	~3–4×
VRAM (FP16)	~140 GB	~90 GB
Tokens/second	baseline	~6×

The throughput advantage is real. The latency advantage exists but is smaller. The VRAM advantage is also real but does not scale with active-parameter count — you must load all experts.

30.7.4 When to Choose Dense

Sequence lengths under 4K tokens
Memory-constrained deployment (inference VRAM budget is the binding constraint)
Single-task fine-tuning (MoE's multi-domain knowledge is wasted)
Simpler serving stack is worth more than the efficiency gain

30.7.5 When to Choose MoE

High throughput requirements (API serving, search augmentation)
Multilingual or multi-domain tasks
Available VRAM exceeds what the dense model needs
Training budget is constrained but you want more total capacity

30.8 Chapter Summary

30.8.1 Key Concepts

Concept	Meaning
MoE	Mixture of Experts — sparse activation for efficient large models
Sparse activation	Only a subset of parameters compute for each token
Router	Linear layer that assigns tokens to experts via Top-K selection
Expert	An independent FFN network with its own parameters
Top-K	Select K highest-scoring experts per token (typically K=2)
Load balancing	Auxiliary loss that encourages uniform expert utilization
MLA	Multi-head Latent Attention — compresses KV cache via low-rank projection
Fine-grained MoE	Many small experts instead of few large ones; lower activation rate
Shared expert	One expert always active; handles universal token patterns

30.8.2 Key Numbers

Mixtral 8x7B:
  Total params: 46.7B  |  Active: 12.9B (27.6%)
  Experts: 8           |  Active per token: 2
  Result: matches LLaMA 2 70B at ~6× faster inference

DeepSeek-V3:
  Total params: 671B   |  Active: 37B (5.5%)
  Experts: 256 + 1     |  Active per token: 8 + 1
  Training cost: $5.5M  (GPT-4: >$100M)

30.8.3 Core Formulas

Router computation:

router_logits = Linear(x)          # hidden_size → num_experts
router_probs  = softmax(router_logits)
top_k_weights, top_k_indices = topk(router_probs, k)

MoE output:

\text{output} = \sum_{i \in \text{TopK}} w_i \cdot E_i(x)

Load-balancing loss:

\mathcal{L}_{aux} = N \sum_{i=1}^{N} f_i \cdot P_i

where $f_i$ is expert selection frequency and $P_i$ is mean routing probability.

30.8.4 My Take

MoE is the clearest example of "frontier AI is systems engineering." The algorithm — route each token to K experts, train with a load-balancing penalty — is not complicated. What is hard is making it work at scale: routing tokens across hundreds of GPUs without all-to-all communication becoming the bottleneck, debugging router collapse in a 671B model, and building a serving stack that handles the dynamic batching pathology.

DeepSeek-V3 is important not because it is cheaper per inference, but because the $5.5M training figure proves that frontier capability is no longer exclusively a function of training budget. Architectural efficiency compounds.

Chapter Checklist

After this chapter, you should be able to:

Explain sparse activation and why it decouples total capacity from per-token compute.
Describe the MoE layer structure: router, Top-K gating, and expert FFNs.
Explain why K=2 is the standard choice for Top-K selection.
Explain load-balancing loss and what happens without it.
Calculate active and total parameter counts for Mixtral 8x7B.
Explain MLA and why it matters for long-context MoE models.
Name at least two MoE failure modes and their mitigations.

See You in the Next Chapter

MoE is about spending compute wisely during training and inference. There is another dimension entirely: spending more compute at inference time to get better answers. Chapter 31 explains the reasoning-model revolution — from GPT-4o's 12% on AIME 2024 to o3's 96.7%, and the open-source story that DeepSeek-R1 made possible.