Narrowing a generalist, without breaking what made it one
Pretraining spends trillions of tokens building a distribution over everything text can do. Supervised finetuning (SFT) then shows the model a few thousand to a few million curated examples. The arithmetic forbids the obvious interpretation: you cannot teach much in 10⁻⁶ of the training signal. What you can do is select. The base model already contains a helpful-assistant mode, a JSON-emitter mode, a radiology-report mode, among its multitudes; SFT moves probability mass onto the mode you want.
Mechanically it is just continued training — same cross-entropy loss, new data, usually with the loss masked to assistant responses so the model learns to produce answers rather than to imitate questions.
The naive approach — full-parameter training at pretraining-scale learning rates on the narrow dataset — fails in a precise way. Gradient descent has no loyalty to old skills: it moves weights to fit the data in front of it, and the new dataset is tiny and unrepresentative. Capabilities that took 10²⁴ FLOPs to build (multilinguality, coding, rare knowledge) degrade after a few GPU-hours of enthusiastic updates, because nothing in the loss says to keep them. The narrower and more repetitive the finetuning data, the faster the erosion.
The standard mitigations all amount to one principle — stay close to where you started:
LoRA reparameterises each adapted weight matrix as the frozen original plus a product of two thin matrices:
The bet is that the change needed for a narrow task is low-rank even though W is not — consistent with the selection view, since steering between existing modes should need few directions. The practical wins follow directly: a few hundred MB of trainable state instead of a full optimiser copy (with quantised bases, QLoRA, a 70B model finetunes on one GPU), adapters that can be swapped per task over one shared base, and forgetting that is bounded by construction because W never moves. The tradeoff is a ceiling: genuinely new, high-rank behaviour — a new language, a new modality — exceeds what a rank-16 correction can express, and full finetuning retakes the lead.
Finetuning is the heaviest of three tools, and the right one least often. The decision factors cleanly: where does the missing piece live?
| The model fails because… | Reach for | Why |
|---|---|---|
| it does not understand what you want (format, tone, role, edge-case policy) | Prompting / few-shot | Zero cost, instant iteration, no infra. Exhaust this first; a surprising fraction of "needs finetuning" is a mediocre prompt. |
| it lacks facts — private, fresh, or long-tail (your docs, today's prices) | RAG | Knowledge belongs in a retrievable store: updateable in seconds, auditable, citable. Finetuning bakes facts into weights where they go stale and cannot be traced. |
| it cannot reliably behave right even with a good prompt — a style, schema or skill needed on every call | Finetune (LoRA first) | Amortises a long prompt into the weights: cheaper per call, more reliable than instructions, and the only fix for behavioural gaps demonstrations can close. |
SFT optimises likelihood of demonstrations, so it can only make the model imitate. It cannot express "of these two valid answers, humans prefer this one", and an imitator inherits the demonstrator's flaws at best. Optimising preference rather than likelihood is a different objective with different machinery — that is RLHF, which begins exactly where SFT stops.