The model is fine; the world moved
Every supervised guarantee rests on one premise: training and deployment data come from the same distribution. Nothing enforces it. Cameras change, users drift, hospitals differ, slang evolves. The model keeps emitting confident predictions on inputs it was never representative of — and unlike a crash, this failure produces no error message. Accuracy decays quietly until someone looks.
Domain adaptation is the study of what to do when the source distribution (where labels are) and the target distribution (where the model runs) differ.
Write the joint as p(x, y) = p(y|x)·p(x). Each factor can move independently, and the remedy depends on which one did:
| Shift | What changes | Example | Diagnostic | Remedy |
|---|---|---|---|---|
| Covariate | p(x); p(y|x) intact | New camera, new demographic | Train a classifier to tell source from target inputs — if it can, p(x) moved | Importance weighting; finetune on target inputs |
| Label | p(y); p(x|y) intact | Disease prevalence rises | Compare predicted label frequencies to target reality | Reweight / recalibrate the prior |
| Concept | p(y|x) itself | "Spam" changes meaning; fraud adapts | Labelled target data only — nothing cheaper detects it | Retrain; no reweighting can save a changed truth |
The table rewards memorising: the first two shifts can in principle be corrected without new labels; concept shift cannot, because the function being learned is no longer the same function.
When the target has inputs but no labels (unsupervised DA), the classic idea is to make the learned features domain-invariant: if a discriminator cannot tell which domain a feature vector came from, a classifier trained on source features should transfer.
DANN implements this adversarially with a gradient-reversal layer: a domain classifier tries to separate domains, while the feature extractor receives its gradient negated — it learns to mix the domains together while still serving the label classifier.
In production, the boring pipeline usually beats the clever algorithm. The hierarchy that works: monitor (track input statistics and prediction distributions, alert on drift), collect (label a small stream of recent target data), retrain (regularly, on a window that reflects today's distribution). Pretrained foundations made this cheaper still — adapting a robust general representation with a little target data (see transfer learning) routinely outperforms purpose-built adaptation machinery.
The durable skill is not any single method. It is asking, whenever a deployed model degrades: which factor of p(x, y) moved?