More parameters per token of compute, paid for in plumbing
In a dense transformer, parameter count and per-token compute are the same knob: each token's forward pass multiplies through every weight. Scaling laws say more parameters means lower loss, but they also mean every single token — including "the" — pays the full price of the entire model.
That coupling is a design choice, not a law. Your brain does not light up entirely to parse a comma. The first-principles question: can a model hold a trillion parameters while each token uses only a small, relevant fraction of them? Mixture of Experts answers yes, by making the network conditional.
Take the transformer's MLP block (which holds roughly two-thirds of the parameters) and replace it with E parallel copies — the experts. A tiny linear layer, the router, scores each token against each expert and sends the token to the top-k (typically k = 1 or 2). The token's output is the weighted sum of its chosen experts; the other E − k experts do nothing for this token.
The router picks two of four experts. Parameters scale with E; per-token compute scales with k.
The arithmetic is the point. Mixtral 8×7B holds ~47B parameters but activates ~13B per token; Switch Transformer pushed past a trillion parameters at the compute of a few billion. Loss tracks something between active and total parameters — you buy real quality with parameters you do not pay for per token.
Train the router naively and a rich-get-richer loop appears. An expert that happens to receive slightly more traffic early trains slightly faster, so the router prefers it more, so it trains faster still. Within a few thousand steps most experts are dead weight and one or two are doing everything — you have paid for E experts and obtained a dense model with extra steps.
The standard fix is a load-balancing auxiliary loss: penalise the product of each expert's routed-token fraction fi and its mean router probability pi, summed over experts.
This is minimised when traffic is uniform, and it nudges the router toward using everyone (DeepSeek-V3 manages with a gentler bias-adjustment scheme instead, but the problem it solves is the same). The tradeoff: the auxiliary loss fights the main loss — perfectly balanced routing is generally not the loss-minimising routing.
| Cost | Why it appears |
|---|---|
| Memory | All E experts must sit in accelerator memory even though k are used per token. FLOPs are sparse; VRAM is dense. Serving MoE means serving the full parameter count. |
| All-to-all communication | Experts are sharded across devices, and each token must physically travel to its experts and back — two all-to-all exchanges per MoE layer, every layer, latency-bound and hard to overlap. |
| Training instability | Routing is a discrete decision trained with continuous gradients; small logit changes flip token assignments. Symptoms: loss spikes, dropped tokens when an expert's capacity buffer overflows. Mitigations: capacity factors, router z-loss, careful precision. |
| Fine-tuning friction | Sparse models overfit small datasets more readily, and balanced routing can degrade when the data distribution narrows. |
Despite the name, experts rarely specialise by human topic. Inspection shows specialisation by token identity and shallow syntax (one expert receives punctuation, another numerals) more often than by semantics. The honest claim is not "a biology expert and a law expert"; it is that conditional computation gives the model more parameters of capacity wherever the router finds it useful. The router is doing dimensionality allocation, not curriculum design.