LLMs

Decoding Techniques

The model gives a distribution; everything else is how you spend it

01 · First principlesWhy not just take the most likely token?

The model outputs a probability distribution over the next token; some rule must turn it into an actual token, repeatedly. The naive rule — greedy argmax, or its lookahead cousin beam search aiming at the most likely sequence — produces text that is fluent for a sentence and then degenerates: it loops, repeats phrases, and flattens into generic mush. The failure has a clean cause. Likely-on-average text is atypical text: humans do not emit maximum-probability sentences (a corpus of them would be unreadable), so chasing the argmax leaves the distribution of real language. Repetition then self-reinforces — a repeated phrase becomes more probable given that it just occurred, and the loop locks in.

So sampling is not a concession; for open-ended text it is the correct objective. The art is in which distribution to sample from, because the raw one has a defect of its own: a long tail of thousands of individually negligible tokens whose total mass is large, and one draw from that tail ("The capital of France is… Brussels") can derail everything after it. Every method below is a different way of reshaping or truncating that tail.

02 · The toolkitReshaping the distribution

Temperature rescales logits before the softmax; the truncation rules then zero out the tail and renormalise:

pi = ezi/T / Σj ezj/T
T < 1 sharpens, T > 1 flattens, T → 0 is greedy
MethodKeep…CharacterWeakness
Top-k the k highest-probability tokens simple, predictable cutoff k is blind to shape: too narrow when the distribution is flat (creative), too wide when it is peaked (factual)
Top-p (nucleus) the smallest set with cumulative mass ≥ p adapts: few tokens when confident, many when uncertain at high uncertainty still admits a wide tail of weak candidates
Min-p tokens with pi ≥ min_p × pmax threshold scales with confidence; robust at high temperature newer, fewer settled defaults
Repetition penalty (all, but down-weights recent tokens) patches the repetition loop directly a hack; punishes legitimate reuse (code, names)
NEXT-TOKEN DISTRIBUTION, SORTED · TOP-P KEEPS THE NUCLEUS, CUTS THE TAIL kept: ~90% of mass cut: individually tiny, jointly dangerous tokens, by rank →

The tail problem. No single grey bar is likely, but their sum is — and one draw from there can derail the continuation.

03 · The splitBeam for closed tasks, sampling for open ones

Whether you want likely or typical text depends on whether the task has a right answer. Translation, summarisation, structured extraction: the target is nearly deterministic given the input, mode-seeking is appropriate, and beam search (keep the B best partial sequences, extend each, re-prune) earns its cost. Stories, dialogue, brainstorming: the target is a distribution, mode-seeking collapses it into mush, and sampling is correct. Most production chat sits in between and runs temperature ≈ 0.7 with top-p or min-p — sampling, lightly tamed.

Worth internalising: low temperature does not make a model more truthful; it makes the model more certain about whatever it already believed, errors included. Hallucinations survive T = 0 comfortably.

04 · The speed trickSpeculative decoding

Decoding is sequential — one forward pass per token — and each pass is memory-bound: the weights are hauled across the memory bus to produce a single token (the same diagnosis as Flash Attention, one level up). The bus crossing costs the same whether you score one token or eight, which creates an exploitable asymmetry: verifying k proposed tokens takes one big-model pass, while generating them would take k.

  1. A small, fast draft model proposes the next k tokens (k ≈ 4–8).
  2. The big model scores all k positions in a single parallel pass.
  3. Accept the longest prefix consistent with the big model's distribution (via a rejection-sampling rule); resample the first rejected position from the big model and continue.

The acceptance rule is built so the output distribution is mathematically identical to decoding with the big model alone — this is a free-lunch speedup (2–3× when the draft agrees often), not an approximation. The draft model only pre-guesses what the big model would have said anyway; easy tokens ("the", closing brackets) get accepted in bulk, hard tokens still cost a full pass.

05 · PerspectiveThe knob is not part of the model

Everything in this note happens after training is finished, in the sampler, at near-zero cost — which makes decoding settings the cheapest quality lever that exists and the most commonly mishandled one. Before concluding a model cannot do a task, check what was done to its distribution on the way out (and, per tokenisation, what was done to its input on the way in).

Mental Model