LLMs

Mixture of Experts

More parameters per token of compute, paid for in plumbing

01 · First principlesWhy must every parameter touch every token?

In a dense transformer, parameter count and per-token compute are the same knob: each token's forward pass multiplies through every weight. Scaling laws say more parameters means lower loss, but they also mean every single token — including "the" — pays the full price of the entire model.

That coupling is a design choice, not a law. Your brain does not light up entirely to parse a comma. The first-principles question: can a model hold a trillion parameters while each token uses only a small, relevant fraction of them? Mixture of Experts answers yes, by making the network conditional.

02 · The mechanismA learned router over parallel MLPs

Take the transformer's MLP block (which holds roughly two-thirds of the parameters) and replace it with E parallel copies — the experts. A tiny linear layer, the router, scores each token against each expert and sends the token to the top-k (typically k = 1 or 2). The token's output is the weighted sum of its chosen experts; the other E − k experts do nothing for this token.

y = Σ_{i ∈ top-k} softmax(W_r x)_i · E_i(x)

router weight expert i's MLP output

The router picks two of four experts. Parameters scale with E; per-token compute scales with k.

The arithmetic is the point. Mixtral 8×7B holds ~47B parameters but activates ~13B per token; Switch Transformer pushed past a trillion parameters at the compute of a few billion. Loss tracks something between active and total parameters — you buy real quality with parameters you do not pay for per token.

03 · The failure modeRouters collapse without help

Train the router naively and a rich-get-richer loop appears. An expert that happens to receive slightly more traffic early trains slightly faster, so the router prefers it more, so it trains faster still. Within a few thousand steps most experts are dead weight and one or two are doing everything — you have paid for E experts and obtained a dense model with extra steps.

The standard fix is a load-balancing auxiliary loss: penalise the product of each expert's routed-token fraction f_i and its mean router probability p_i, summed over experts.

L_aux = α · E · Σ_i f_i · p_i

fraction of tokens routed to i mean router prob for i

This is minimised when traffic is uniform, and it nudges the router toward using everyone (DeepSeek-V3 manages with a gentler bias-adjustment scheme instead, but the problem it solves is the same). The tradeoff: the auxiliary loss fights the main loss — perfectly balanced routing is generally not the loss-minimising routing.

04 · The billWhat the sparsity costs

Cost	Why it appears
Memory	All E experts must sit in accelerator memory even though k are used per token. FLOPs are sparse; VRAM is dense. Serving MoE means serving the full parameter count.
All-to-all communication	Experts are sharded across devices, and each token must physically travel to its experts and back — two all-to-all exchanges per MoE layer, every layer, latency-bound and hard to overlap.
Training instability	Routing is a discrete decision trained with continuous gradients; small logit changes flip token assignments. Symptoms: loss spikes, dropped tokens when an expert's capacity buffer overflows. Mitigations: capacity factors, router z-loss, careful precision.
Fine-tuning friction	Sparse models overfit small datasets more readily, and balanced routing can degrade when the data distribution narrows.

The honest summary: MoE converts a compute problem into a memory-and-networking problem. That is a good trade exactly when you have many accelerators with fast interconnect and your bottleneck is FLOPs — which is the situation at frontier scale, and why most frontier models are now MoE.

05 · PerspectiveWhat the experts actually learn

Despite the name, experts rarely specialise by human topic. Inspection shows specialisation by token identity and shallow syntax (one expert receives punctuation, another numerals) more often than by semantics. The honest claim is not "a biology expert and a law expert"; it is that conditional computation gives the model more parameters of capacity wherever the router finds it useful. The router is doing dimensionality allocation, not curriculum design.

Mental Model

MoE decouples the two things dense models conflate: parameters held (capacity) and parameters used per token (compute).
A tiny learned router sends each token to the top-k of E expert MLPs; quality tracks total parameters at roughly the cost of active ones.
Unregularised routers collapse onto a few experts; the load-balancing auxiliary loss exists to stop the rich-get-richer loop.
You pay in memory (all experts resident), all-to-all communication, and training stability.
Experts specialise by token statistics, not by subject matter; do not anthropomorphise the router.