LLMs

Finetuning

Narrowing a generalist, without breaking what made it one

01 · First principlesWhat can a few thousand examples actually change?

Pretraining spends trillions of tokens building a distribution over everything text can do. Supervised finetuning (SFT) then shows the model a few thousand to a few million curated examples. The arithmetic forbids the obvious interpretation: you cannot teach much in 10⁻⁶ of the training signal. What you can do is select. The base model already contains a helpful-assistant mode, a JSON-emitter mode, a radiology-report mode, among its multitudes; SFT moves probability mass onto the mode you want.

The one-sentence theory: SFT narrows behaviour; it does not add knowledge. If the base model cannot do the task given a perfect prompt and a few examples, finetuning on the same kind of data will mostly teach it to confidently format answers it does not have — the classic recipe for fluent hallucination.

Mechanically it is just continued training — same cross-entropy loss, new data, usually with the loss masked to assistant responses so the model learns to produce answers rather than to imitate questions.

02 · The failureCatastrophic forgetting

The naive approach — full-parameter training at pretraining-scale learning rates on the narrow dataset — fails in a precise way. Gradient descent has no loyalty to old skills: it moves weights to fit the data in front of it, and the new dataset is tiny and unrepresentative. Capabilities that took 10²⁴ FLOPs to build (multilinguality, coding, rare knowledge) degrade after a few GPU-hours of enthusiastic updates, because nothing in the loss says to keep them. The narrower and more repetitive the finetuning data, the faster the erosion.

The standard mitigations all amount to one principle — stay close to where you started:

Low learning rates and few epochs (typically 10–100× below pretraining LR): limit total weight movement by brute restraint.
Replay: mix a slice of general pretraining-like data into the finetuning batch, so the old distribution keeps voting.
LoRA: freeze the weights entirely and learn a low-rank correction beside them.

03 · The workhorseLoRA, in one equation

LoRA reparameterises each adapted weight matrix as the frozen original plus a product of two thin matrices:

W' = W + (α/r) · B A, A ∈ ℝ^r×d, B ∈ ℝ^d×r, r ≪ d

frozen pretrained weights trainable low-rank update, ~0.1–1% of parameters

The bet is that the change needed for a narrow task is low-rank even though W is not — consistent with the selection view, since steering between existing modes should need few directions. The practical wins follow directly: a few hundred MB of trainable state instead of a full optimiser copy (with quantised bases, QLoRA, a 70B model finetunes on one GPU), adapters that can be swapped per task over one shared base, and forgetting that is bounded by construction because W never moves. The tradeoff is a ceiling: genuinely new, high-rank behaviour — a new language, a new modality — exceeds what a rank-16 correction can express, and full finetuning retakes the lead.

04 · The decisionPrompt, RAG, or finetune

Finetuning is the heaviest of three tools, and the right one least often. The decision factors cleanly: where does the missing piece live?

The model fails because…	Reach for	Why
it does not understand what you want (format, tone, role, edge-case policy)	Prompting / few-shot	Zero cost, instant iteration, no infra. Exhaust this first; a surprising fraction of "needs finetuning" is a mediocre prompt.
it lacks facts — private, fresh, or long-tail (your docs, today's prices)	RAG	Knowledge belongs in a retrievable store: updateable in seconds, auditable, citable. Finetuning bakes facts into weights where they go stale and cannot be traced.
it cannot reliably behave right even with a good prompt — a style, schema or skill needed on every call	Finetune (LoRA first)	Amortises a long prompt into the weights: cheaper per call, more reliable than instructions, and the only fix for behavioural gaps demonstrations can close.

They compose: the common production stack is a finetuned model (behaviour) over a RAG pipeline (knowledge) under a short prompt (instructions). The methods answer different failure modes, so the choice is rarely either-or.

05 · The boundaryWhat SFT cannot reach

SFT optimises likelihood of demonstrations, so it can only make the model imitate. It cannot express "of these two valid answers, humans prefer this one", and an imitator inherits the demonstrator's flaws at best. Optimising preference rather than likelihood is a different objective with different machinery — that is RLHF, which begins exactly where SFT stops.

Mental Model

SFT selects among behaviours the base model already has; the token budget is too small to add knowledge.
Catastrophic forgetting is gradient descent's default; every mitigation (low LR, replay, LoRA) is a way of staying near the pretrained weights.
LoRA: freeze W, learn a low-rank delta — because behavioural steering is a low-rank change.
Missing instructions → prompt. Missing facts → RAG. Missing reliable behaviour → finetune. In that order, and they stack.
Finetuning on facts the model does not know teaches the format of knowledge without the substance — hallucination with better grammar.