LLMs

Mixture of Experts

More parameters per token of compute, paid for in plumbing

01 · First principlesWhy must every parameter touch every token?

In a dense transformer, parameter count and per-token compute are the same knob: each token's forward pass multiplies through every weight. Scaling laws say more parameters means lower loss, but they also mean every single token — including "the" — pays the full price of the entire model.

That coupling is a design choice, not a law. Your brain does not light up entirely to parse a comma. The first-principles question: can a model hold a trillion parameters while each token uses only a small, relevant fraction of them? Mixture of Experts answers yes, by making the network conditional.

02 · The mechanismA learned router over parallel MLPs

Take the transformer's MLP block (which holds roughly two-thirds of the parameters) and replace it with E parallel copies — the experts. A tiny linear layer, the router, scores each token against each expert and sends the token to the top-k (typically k = 1 or 2). The token's output is the weighted sum of its chosen experts; the other E − k experts do nothing for this token.

y = Σi ∈ top-k softmax(Wr x)i · Ei(x)
router weight expert i's MLP output
token x router expert 1 expert 2 · 0.7 expert 3 expert 4 · 0.3 Σ top-2 routing: grey experts spend zero FLOPs on this token

The router picks two of four experts. Parameters scale with E; per-token compute scales with k.

The arithmetic is the point. Mixtral 8×7B holds ~47B parameters but activates ~13B per token; Switch Transformer pushed past a trillion parameters at the compute of a few billion. Loss tracks something between active and total parameters — you buy real quality with parameters you do not pay for per token.

03 · The failure modeRouters collapse without help

Train the router naively and a rich-get-richer loop appears. An expert that happens to receive slightly more traffic early trains slightly faster, so the router prefers it more, so it trains faster still. Within a few thousand steps most experts are dead weight and one or two are doing everything — you have paid for E experts and obtained a dense model with extra steps.

The standard fix is a load-balancing auxiliary loss: penalise the product of each expert's routed-token fraction fi and its mean router probability pi, summed over experts.

Laux = α · E · Σi fi · pi
fraction of tokens routed to i mean router prob for i

This is minimised when traffic is uniform, and it nudges the router toward using everyone (DeepSeek-V3 manages with a gentler bias-adjustment scheme instead, but the problem it solves is the same). The tradeoff: the auxiliary loss fights the main loss — perfectly balanced routing is generally not the loss-minimising routing.

04 · The billWhat the sparsity costs

CostWhy it appears
Memory All E experts must sit in accelerator memory even though k are used per token. FLOPs are sparse; VRAM is dense. Serving MoE means serving the full parameter count.
All-to-all communication Experts are sharded across devices, and each token must physically travel to its experts and back — two all-to-all exchanges per MoE layer, every layer, latency-bound and hard to overlap.
Training instability Routing is a discrete decision trained with continuous gradients; small logit changes flip token assignments. Symptoms: loss spikes, dropped tokens when an expert's capacity buffer overflows. Mitigations: capacity factors, router z-loss, careful precision.
Fine-tuning friction Sparse models overfit small datasets more readily, and balanced routing can degrade when the data distribution narrows.
The honest summary: MoE converts a compute problem into a memory-and-networking problem. That is a good trade exactly when you have many accelerators with fast interconnect and your bottleneck is FLOPs — which is the situation at frontier scale, and why most frontier models are now MoE.

05 · PerspectiveWhat the experts actually learn

Despite the name, experts rarely specialise by human topic. Inspection shows specialisation by token identity and shallow syntax (one expert receives punctuation, another numerals) more often than by semantics. The honest claim is not "a biology expert and a law expert"; it is that conditional computation gives the model more parameters of capacity wherever the router finds it useful. The router is doing dimensionality allocation, not curriculum design.

Mental Model