The model gives a distribution; everything else is how you spend it
The model outputs a probability distribution over the next token; some rule must turn it into an actual token, repeatedly. The naive rule — greedy argmax, or its lookahead cousin beam search aiming at the most likely sequence — produces text that is fluent for a sentence and then degenerates: it loops, repeats phrases, and flattens into generic mush. The failure has a clean cause. Likely-on-average text is atypical text: humans do not emit maximum-probability sentences (a corpus of them would be unreadable), so chasing the argmax leaves the distribution of real language. Repetition then self-reinforces — a repeated phrase becomes more probable given that it just occurred, and the loop locks in.
So sampling is not a concession; for open-ended text it is the correct objective. The art is in which distribution to sample from, because the raw one has a defect of its own: a long tail of thousands of individually negligible tokens whose total mass is large, and one draw from that tail ("The capital of France is… Brussels") can derail everything after it. Every method below is a different way of reshaping or truncating that tail.
Temperature rescales logits before the softmax; the truncation rules then zero out the tail and renormalise:
| Method | Keep… | Character | Weakness |
|---|---|---|---|
| Top-k | the k highest-probability tokens | simple, predictable cutoff | k is blind to shape: too narrow when the distribution is flat (creative), too wide when it is peaked (factual) |
| Top-p (nucleus) | the smallest set with cumulative mass ≥ p | adapts: few tokens when confident, many when uncertain | at high uncertainty still admits a wide tail of weak candidates |
| Min-p | tokens with pi ≥ min_p × pmax | threshold scales with confidence; robust at high temperature | newer, fewer settled defaults |
| Repetition penalty | (all, but down-weights recent tokens) | patches the repetition loop directly | a hack; punishes legitimate reuse (code, names) |
The tail problem. No single grey bar is likely, but their sum is — and one draw from there can derail the continuation.
Whether you want likely or typical text depends on whether the task has a right answer. Translation, summarisation, structured extraction: the target is nearly deterministic given the input, mode-seeking is appropriate, and beam search (keep the B best partial sequences, extend each, re-prune) earns its cost. Stories, dialogue, brainstorming: the target is a distribution, mode-seeking collapses it into mush, and sampling is correct. Most production chat sits in between and runs temperature ≈ 0.7 with top-p or min-p — sampling, lightly tamed.
Decoding is sequential — one forward pass per token — and each pass is memory-bound: the weights are hauled across the memory bus to produce a single token (the same diagnosis as Flash Attention, one level up). The bus crossing costs the same whether you score one token or eight, which creates an exploitable asymmetry: verifying k proposed tokens takes one big-model pass, while generating them would take k.
The acceptance rule is built so the output distribution is mathematically identical to decoding with the big model alone — this is a free-lunch speedup (2–3× when the draft agrees often), not an approximation. The draft model only pre-guesses what the big model would have said anyway; easy tokens ("the", closing brackets) get accepted in bulk, hard tokens still cost a full pass.
Everything in this note happens after training is finished, in the sampler, at near-zero cost — which makes decoding settings the cheapest quality lever that exists and the most commonly mishandled one. Before concluding a model cannot do a task, check what was done to its distribution on the way out (and, per tokenisation, what was done to its input on the way in).