General ML

Transfer Learning

Do not re-learn what is already learned

01 · First principlesWhy training from scratch is wasteful

Train an image classifier from scratch on 2,000 medical scans and watch it fail. The reason is not the classifier head; it is that the network spends its tiny data budget re-discovering edges, textures, and shapes — features that every visual task on Earth shares. The same holds for text: syntax and word meaning do not need to be re-learned per task.

The observation that powers the whole field: early representations are general; only late ones are task-specific. So learn the general part once, on the biggest dataset available, and spend your scarce labels only on the specific part.

02 · The mechanismPretrain, then adapt

Take a network pretrained on a large source task (ImageNet, the open web — see pretraining). Two recipes for the target task:

Freeze and probe
Keep the pretrained weights fixed; train only a new head (often a linear layer) on the frozen features. Few parameters → very little data needed, no risk of destroying the representation. Ceiling: the features were tuned for someone else's task.
Full finetune
Continue training everything at a small learning rate. Higher ceiling, but each step can overwrite general features (catastrophic forgetting), and with little data it overfits — see finetuning.

The small learning rate is not a detail. Pretrained weights are a good init near a good solution; large steps throw that inheritance away. Middle paths exist: unfreeze only the top blocks, or use parameter-efficient methods like LoRA.

03 · ChoosingData size × domain gap

Target dataDomain close to sourceDomain far from source
Small Freeze + linear probe Freeze most; finetune top blocks carefully (highest-risk cell)
Large Full finetune, small LR Full finetune; pretraining still helps as an init, just less

The two axes capture nearly every practical decision: how much can you afford to move the weights (data size), and how much do you need to (domain gap).

04 · The tradeoffNegative transfer

Transfer is a prior, and priors can be wrong. If the source task's structure does not match the target's (natural images → spectrograms, English → protein sequences), the inherited features can steer learning into a worse basin than a random init would — negative transfer. It is uncommon with large general-purpose pretraining but real; the diagnostic is simple: a from-scratch baseline on a data subset. If scratch competes, the inheritance is not helping.

The modern default. Pretrain-then-adapt stopped being a technique and became the organising pattern of the field; "from scratch" is now the exception that needs justifying. Related: few-shot / zero-shot learning, where adaptation shrinks to a prompt.
Mental Model