General ML

Transfer Learning

Do not re-learn what is already learned

01 · First principlesWhy training from scratch is wasteful

Train an image classifier from scratch on 2,000 medical scans and watch it fail. The reason is not the classifier head; it is that the network spends its tiny data budget re-discovering edges, textures, and shapes — features that every visual task on Earth shares. The same holds for text: syntax and word meaning do not need to be re-learned per task.

The observation that powers the whole field: early representations are general; only late ones are task-specific. So learn the general part once, on the biggest dataset available, and spend your scarce labels only on the specific part.

02 · The mechanismPretrain, then adapt

Take a network pretrained on a large source task (ImageNet, the open web — see pretraining). Two recipes for the target task:

Freeze and probe

Keep the pretrained weights fixed; train only a new head (often a linear layer) on the frozen features. Few parameters → very little data needed, no risk of destroying the representation. Ceiling: the features were tuned for someone else's task.

Full finetune

Continue training everything at a small learning rate. Higher ceiling, but each step can overwrite general features (catastrophic forgetting), and with little data it overfits — see finetuning.

The small learning rate is not a detail. Pretrained weights are a good init near a good solution; large steps throw that inheritance away. Middle paths exist: unfreeze only the top blocks, or use parameter-efficient methods like LoRA.

03 · ChoosingData size × domain gap

Target data	Domain close to source	Domain far from source
Small	Freeze + linear probe	Freeze most; finetune top blocks carefully (highest-risk cell)
Large	Full finetune, small LR	Full finetune; pretraining still helps as an init, just less

The two axes capture nearly every practical decision: how much can you afford to move the weights (data size), and how much do you need to (domain gap).

04 · The tradeoffNegative transfer

Transfer is a prior, and priors can be wrong. If the source task's structure does not match the target's (natural images → spectrograms, English → protein sequences), the inherited features can steer learning into a worse basin than a random init would — negative transfer. It is uncommon with large general-purpose pretraining but real; the diagnostic is simple: a from-scratch baseline on a data subset. If scratch competes, the inheritance is not helping.

The modern default. Pretrain-then-adapt stopped being a technique and became the organising pattern of the field; "from scratch" is now the exception that needs justifying. Related: few-shot / zero-shot learning, where adaptation shrinks to a prompt.

Mental Model

Early layers learn the world; late layers learn the task. Buy the world pretrained, pay only for the task.
Two dials: how much data you have (can you afford to move weights) and how far your domain is (must you move them).
Small learning rates protect the inheritance; freezing protects it absolutely at the cost of a ceiling.
Transfer is a prior. Validate it against a scratch baseline; priors can be wrong.