The compromise between characters and words, and the strange failures it causes
A language model consumes a sequence of integers from a fixed vocabulary. Something must chop text into those integers, and the chop size is a genuine tradeoff, because the model pays for sequence length (attention is quadratic, context windows are finite) while the embedding table pays for vocabulary size. The two obvious extremes both fail:
The question becomes: what unit is small enough to cover everything, yet large enough to keep sequences short? The answer is statistical, not linguistic — let the corpus decide.
BPE builds a subword vocabulary by greedy compression. Start from bytes (so coverage is total by construction) and repeat one step until the vocabulary reaches its target size:
Frequent strings end up as single tokens, rare strings decompose into pieces, and nothing is ever unrepresentable. To tokenise new text, replay the merges in learned order. The alternative, a unigram LM tokeniser (SentencePiece), works top-down instead — start with a huge candidate vocabulary, repeatedly prune the pieces whose removal least hurts corpus likelihood, then segment by maximum likelihood. It tends to produce slightly more linguistically natural pieces; both land in the same place: common-things-short, rare-things-decomposed.
Vocabulary size sets where on the character–word axis you sit. Bigger vocabularies compress better (fewer tokens per text, so more effective context and fewer decode steps) but cost a larger embedding/output matrix, leave rare tokens undertrained, and hit diminishing returns once common words are single tokens anyway. GPT-2 used ~50k; recent models drift toward 100k–250k, partly to compress non-English text and code better, partly because at large model sizes the embedding table stops being a meaningful fraction of parameters.
| Vocab size ↑ | Effect |
|---|---|
| Shorter sequences | more text per context window, fewer autoregressive steps at inference |
| Better multilingual / code coverage | frequent foreign words and idioms become single tokens |
| Bigger embedding + softmax | matters for small models, negligible at frontier scale |
| Rare-token undertraining | tail tokens get few gradient updates; "glitch tokens" are this failure in the extreme |
A surprising share of famous LLM failures are tokeniser artifacts, and they share one cause: the model never sees characters; it sees opaque IDs whose internal structure must be inferred statistically.
The tokeniser is trained once, separately, before pretraining, and is frozen forever after — every later capability is built on its choices. It is also a compressor, which makes it kin to the language model itself (cross-entropy training is compression); BPE is simply the cheap, frozen first stage of that compression. Byte-level models that delete the tokeniser keep being proposed, and keep paying the sequence-length bill that subwords were invented to avoid; until that changes, tokenisation remains the contract everything else is built on.