Do not re-learn what is already learned
Train an image classifier from scratch on 2,000 medical scans and watch it fail. The reason is not the classifier head; it is that the network spends its tiny data budget re-discovering edges, textures, and shapes — features that every visual task on Earth shares. The same holds for text: syntax and word meaning do not need to be re-learned per task.
The observation that powers the whole field: early representations are general; only late ones are task-specific. So learn the general part once, on the biggest dataset available, and spend your scarce labels only on the specific part.
Take a network pretrained on a large source task (ImageNet, the open web — see pretraining). Two recipes for the target task:
The small learning rate is not a detail. Pretrained weights are a good init near a good solution; large steps throw that inheritance away. Middle paths exist: unfreeze only the top blocks, or use parameter-efficient methods like LoRA.
| Target data | Domain close to source | Domain far from source |
|---|---|---|
| Small | Freeze + linear probe | Freeze most; finetune top blocks carefully (highest-risk cell) |
| Large | Full finetune, small LR | Full finetune; pretraining still helps as an init, just less |
The two axes capture nearly every practical decision: how much can you afford to move the weights (data size), and how much do you need to (domain gap).
Transfer is a prior, and priors can be wrong. If the source task's structure does not match the target's (natural images → spectrograms, English → protein sequences), the inherited features can steer learning into a worse basin than a random init would — negative transfer. It is uncommon with large general-purpose pretraining but real; the diagnostic is simple: a from-scratch baseline on a data subset. If scratch competes, the inheritance is not helping.