LLMs

Tokenisation

The compromise between characters and words, and the strange failures it causes

01 · First principlesText must become integers — at what granularity?

A language model consumes a sequence of integers from a fixed vocabulary. Something must chop text into those integers, and the chop size is a genuine tradeoff, because the model pays for sequence length (attention is quadratic, context windows are finite) while the embedding table pays for vocabulary size. The two obvious extremes both fail:

Characters · vocab ~100
Nothing is ever out of vocabulary, but every word costs 5–10 positions. Sequence length quadruples, effective context shrinks, and the model must spend layers re-learning that t-h-e is one thing. Compute is wasted relearning spelling.
Words · vocab ~10⁶+
Compact sequences, but language has a long tail that no list covers: names, typos, code identifiers, morphology ("untokenisable"), other languages. Every miss becomes <UNK> — information destroyed at the front door.

The question becomes: what unit is small enough to cover everything, yet large enough to keep sequences short? The answer is statistical, not linguistic — let the corpus decide.

02 · The mechanismByte-Pair Encoding

BPE builds a subword vocabulary by greedy compression. Start from bytes (so coverage is total by construction) and repeat one step until the vocabulary reaches its target size:

  1. Count every adjacent pair of tokens in the corpus.
  2. Merge the most frequent pair into a single new token; add it to the vocabulary.
  3. Repeat. Early merges learn "th", "in"; later ones learn " the", "tion", "ization".

Frequent strings end up as single tokens, rare strings decompose into pieces, and nothing is ever unrepresentable. To tokenise new text, replay the merges in learned order. The alternative, a unigram LM tokeniser (SentencePiece), works top-down instead — start with a huge candidate vocabulary, repeatedly prune the pieces whose removal least hurts corpus likelihood, then segment by maximum likelihood. It tends to produce slightly more linguistically natural pieces; both land in the same place: common-things-short, rare-things-decomposed.

"unbelievable" → un · believ · able     "the" → the (one token)
frequency, not morphology — it just often agrees

03 · The knobVocabulary size

Vocabulary size sets where on the character–word axis you sit. Bigger vocabularies compress better (fewer tokens per text, so more effective context and fewer decode steps) but cost a larger embedding/output matrix, leave rare tokens undertrained, and hit diminishing returns once common words are single tokens anyway. GPT-2 used ~50k; recent models drift toward 100k–250k, partly to compress non-English text and code better, partly because at large model sizes the embedding table stops being a meaningful fraction of parameters.

Vocab size ↑Effect
Shorter sequencesmore text per context window, fewer autoregressive steps at inference
Better multilingual / code coveragefrequent foreign words and idioms become single tokens
Bigger embedding + softmaxmatters for small models, negligible at frontier scale
Rare-token undertrainingtail tokens get few gradient updates; "glitch tokens" are this failure in the extreme

04 · The pathologyWhy models are bad at things tokenisation hides

A surprising share of famous LLM failures are tokeniser artifacts, and they share one cause: the model never sees characters; it sees opaque IDs whose internal structure must be inferred statistically.

Diagnostic habit: when an LLM fails at something a child finds easy, check the token boundaries before theorising about reasoning. The failure is often in the input representation, not the network. (Sampling oddities blamed on decoding sometimes trace here too.)

05 · PerspectiveA lossy contract signed before training

The tokeniser is trained once, separately, before pretraining, and is frozen forever after — every later capability is built on its choices. It is also a compressor, which makes it kin to the language model itself (cross-entropy training is compression); BPE is simply the cheap, frozen first stage of that compression. Byte-level models that delete the tokeniser keep being proposed, and keep paying the sequence-length bill that subwords were invented to avoid; until that changes, tokenisation remains the contract everything else is built on.

Mental Model