LLMs

Tokenisation

The compromise between characters and words, and the strange failures it causes

01 · First principlesText must become integers — at what granularity?

A language model consumes a sequence of integers from a fixed vocabulary. Something must chop text into those integers, and the chop size is a genuine tradeoff, because the model pays for sequence length (attention is quadratic, context windows are finite) while the embedding table pays for vocabulary size. The two obvious extremes both fail:

Characters · vocab ~100

Nothing is ever out of vocabulary, but every word costs 5–10 positions. Sequence length quadruples, effective context shrinks, and the model must spend layers re-learning that t-h-e is one thing. Compute is wasted relearning spelling.

Words · vocab ~10⁶+

Compact sequences, but language has a long tail that no list covers: names, typos, code identifiers, morphology ("untokenisable"), other languages. Every miss becomes <UNK> — information destroyed at the front door.

The question becomes: what unit is small enough to cover everything, yet large enough to keep sequences short? The answer is statistical, not linguistic — let the corpus decide.

02 · The mechanismByte-Pair Encoding

BPE builds a subword vocabulary by greedy compression. Start from bytes (so coverage is total by construction) and repeat one step until the vocabulary reaches its target size:

Count every adjacent pair of tokens in the corpus.
Merge the most frequent pair into a single new token; add it to the vocabulary.
Repeat. Early merges learn "th", "in"; later ones learn " the", "tion", "ization".

Frequent strings end up as single tokens, rare strings decompose into pieces, and nothing is ever unrepresentable. To tokenise new text, replay the merges in learned order. The alternative, a unigram LM tokeniser (SentencePiece), works top-down instead — start with a huge candidate vocabulary, repeatedly prune the pieces whose removal least hurts corpus likelihood, then segment by maximum likelihood. It tends to produce slightly more linguistically natural pieces; both land in the same place: common-things-short, rare-things-decomposed.

"unbelievable" → un · believ · able "the" → the (one token)

frequency, not morphology — it just often agrees

03 · The knobVocabulary size

Vocabulary size sets where on the character–word axis you sit. Bigger vocabularies compress better (fewer tokens per text, so more effective context and fewer decode steps) but cost a larger embedding/output matrix, leave rare tokens undertrained, and hit diminishing returns once common words are single tokens anyway. GPT-2 used ~50k; recent models drift toward 100k–250k, partly to compress non-English text and code better, partly because at large model sizes the embedding table stops being a meaningful fraction of parameters.

Vocab size ↑	Effect
Shorter sequences	more text per context window, fewer autoregressive steps at inference
Better multilingual / code coverage	frequent foreign words and idioms become single tokens
Bigger embedding + softmax	matters for small models, negligible at frontier scale
Rare-token undertraining	tail tokens get few gradient updates; "glitch tokens" are this failure in the extreme

04 · The pathologyWhy models are bad at things tokenisation hides

A surprising share of famous LLM failures are tokeniser artifacts, and they share one cause: the model never sees characters; it sees opaque IDs whose internal structure must be inferred statistically.

Spelling and counting letters. "strawberry" is one or two tokens; knowing it contains three r's requires having memorised the spelling of a symbol the model cannot look inside. Asking how many r's is asking about pixels of an image it was never shown.
Arithmetic. Numbers are chunked inconsistently — 1234 might be "12"+"34", 1235 might be "123"+"5" — so digit-place alignment, the heart of column arithmetic, is scrambled at the input. (Modern tokenisers force single-digit or fixed-3-digit number tokens precisely for this reason.)
Multilingual inequity. Merges follow corpus frequency, and corpora are mostly English. The same sentence can cost 2–4× more tokens in Burmese or Tamil than in English — meaning less effective context, more inference cost per word, and worse quality, all before the model proper does anything.
Whitespace and code quirks. " hello" and "hello" are different tokens; trailing spaces flip completions. Indentation-heavy code lives or dies by whitespace merge choices.

Diagnostic habit: when an LLM fails at something a child finds easy, check the token boundaries before theorising about reasoning. The failure is often in the input representation, not the network. (Sampling oddities blamed on decoding sometimes trace here too.)

05 · PerspectiveA lossy contract signed before training

The tokeniser is trained once, separately, before pretraining, and is frozen forever after — every later capability is built on its choices. It is also a compressor, which makes it kin to the language model itself (cross-entropy training is compression); BPE is simply the cheap, frozen first stage of that compression. Byte-level models that delete the tokeniser keep being proposed, and keep paying the sequence-length bill that subwords were invented to avoid; until that changes, tokenisation remains the contract everything else is built on.

Mental Model

Characters waste sequence length; words cannot cover the long tail. Subwords are the negotiated middle.
BPE = greedy compression: repeatedly merge the most frequent pair, starting from bytes, so coverage is total and frequent strings get short codes.
Vocab size trades sequence length against embedding size and tail-token training.
The model sees opaque IDs, not letters — spelling, arithmetic, and multilingual failures are input artifacts, not reasoning failures.
The tokeniser is frozen before training and constrains everything after; check token boundaries before blaming the model.