General ML

K-Nearest Neighbours

No training, all inference — the model is the data

01 · First principlesThe laziest possible learner

Every learner exploits some assumed structure in the data. k-NN exploits the weakest one imaginable: smoothness — nearby inputs tend to share labels. If you believe only that and nothing else, the honest algorithm writes itself: store every training point, and to predict at x, find the k stored points nearest to x and let them vote (or average, for regression).

There is no training phase at all. No parameters are fit, no loss is minimised; the "model" is the training set itself plus a distance function. All the cost moves to inference, where a naive query scans all n points. This inversion (zero work up front, full work per query) is the opposite of a parametric model like an SVM, which spends effort once to compress the data into a few parameters.

One sentence: k-NN is the claim that memory plus a good notion of similarity is already a classifier.

02 · The knobk is the bias–variance dial

k-NN puts the bias–variance tradeoff on a single integer, which makes it the cleanest teaching example in the field.

k = 1 · memorisation
Every training point is its own island; the decision boundary wraps around individual points, including mislabelled ones. Zero training error, maximal variance: redraw the dataset and the boundary changes everywhere.
k = n · the majority oracle
Every query gets the same answer: the global majority class. The boundary disappears entirely. Maximal bias, zero variance — the model has stopped looking at x.

Sliding k from 1 to n smooths the boundary continuously between these extremes. Larger k averages over more neighbours, suppressing noise (variance down) while blurring genuine fine structure (bias up). Choose k by cross-validation; odd k avoids ties in binary problems.

k = 1 · JAGGED, WRAPS NOISE k = 15 · SMOOTH

Same data, two values of k. At k = 1 a single mislabelled red point earns its own enclave; at k = 15 its neighbours outvote it.

03 · The real modelDistance is everything

People say k-NN "has no model." It does — the model is the distance function, and it is supplied by you, untrained. Two consequences follow immediately:

d(x, x′) = ( Σj=1..d wj (xj − x′j)² )1/2    — every choice of wj is a different model

04 · The nemesisHigh dimensions

The smoothness bet requires that "nearest" neighbours are actually near. In high dimensions they are not: volume concentrates so that all pairwise distances become nearly equal, and the ratio of nearest to farthest neighbour distance drifts toward 1. To capture even a small fixed fraction of the data within a neighbourhood, the neighbourhood's radius must grow until "local" is meaningless. The voting neighbours are then no closer to the query than random points, and k-NN degrades to guessing the prior.

This is the curse of dimensionality in its purest form, and k-NN is its most direct victim because k-NN has nothing else to fall back on — no learned weights, no margins, no feature selection.

05 · RedemptionEmbeddings made k-NN modern

The fix was not to repair k-NN but to repair the space it operates in. Deep networks learn embeddings in which semantic similarity is geometric proximity — and in such a space, nearest-neighbour lookup is exactly the right algorithm again. Add approximate nearest-neighbour indexes (HNSW, IVF, product quantisation) to cut query cost from O(n) to roughly O(log n), and the lazy learner becomes infrastructure:

The arc: as a classifier on raw tabular features, k-NN is a teaching tool. As the lookup step on top of learned representations, it quietly runs a large fraction of production ML.
Mental Model