General ML

No Free Lunch Theorem

Why no learner wins everywhere, and why learning works regardless

01 · The claimAveraged over all worlds, everyone ties

The first-principles question: is there a universally best learning algorithm — one that beats the others no matter what function generates the data? Wolpert's answer is a flat no, and stronger than intuition expects:

No Free Lunch: averaged uniformly over all possible target functions, every learning algorithm has exactly the same off-training-set generalisation performance. Gradient boosting, a deep network, and "predict by hashing the input" tie. So does predicting the opposite of your model.

Note the precise scope: performance on points outside the training set, averaged over every function from inputs to labels, each weighted equally. Both italicised phrases carry the whole theorem.

02 · Why it must be trueA random world has no pattern to exploit

The proof idea fits in a paragraph. Fix any training set you like. For every target function that agrees with your training data and continues one way off it, there is another function agreeing with the same training data and continuing every other way — and uniform averaging weights them identically. Whatever your algorithm predicts at an unseen point, exactly half the consistent worlds say it is right and half say it is wrong (in the binary case). The training data constrains nothing beyond itself.

Ef∼uniform[ generalisation(A, f) ]  is the same for every algorithm A

In a world drawn uniformly at random, the past is literally uninformative about the future at new inputs. Learning is impossible there — not hard, impossible — and all algorithms are equal the way all lottery strategies are equal.

03 · The resolutionThe real world is not uniform over functions

Yet learning plainly works — models trained on Tuesday predict Wednesday. The theorem and the practice are reconciled by its premise: nature does not draw target functions uniformly. The functions we actually meet are an astronomically thin, heavily structured slice:

Every algorithm's inductive bias — its architecture, its regulariser, its preference among hypotheses that fit the data equally — is a bet that the world has one of these structures. NFL says the bet cannot be free: any bias that helps on structured worlds must hurt on the (unstructured) rest. Learning works because we keep playing in the casino where our bets happen to match the house.

One sentence to keep: generalisation is never paid for by data alone; it is paid for by data plus a correct assumption about the world, and NFL is the theorem that forbids skipping the second payment.

04 · ConsequencesMatch the bias to the structure

"No universal best model" is not a counsel of despair; it is a design instruction. The practical craft of ML is choosing the algorithm whose bet matches your data's structure:

Data structureMatching inductive biasThe bet it encodes
ImagesCNNsLocality + translation invariance: nearby pixels relate; position does not change identity.
Sequences / languageTransformers, RNNsOrder matters; long-range dependencies exist and are sparse enough to attend to.
Graphs / moleculesGNNsInformation flows along edges; node identity is permutation-invariant.
Tabular, heterogeneous columnsGradient-boosted treesAxis-aligned splits and feature interactions; no smoothness across unrelated columns.
Small data, strong theoryBayesian models with informative priorsThe structure is known well enough to write down — see MLE vs MAP.

This is also why transformers eating every domain does not refute NFL: attention plus massive data is a broad bias (weak assumptions, high capacity) purchased with enormous sample cost — exactly the trade the bias–variance note prices.

05 · HumilityWhat the theorem does to leaderboards

A benchmark is a sample from one structured corner of function space. "SOTA on the benchmark" therefore means "this bias matches this corner", and licenses no claim about other corners — a fact rediscovered every time a leaderboard champion is deployed against slightly shifted data. NFL is the formal reason that:

Interview-grade summary: NFL says all algorithms tie on average over all possible worlds; practice works because our world is a very particular one, and good ML is the art of encoding which particular one into the model.
Mental Model