General ML

K-Nearest Neighbours

No training, all inference — the model is the data

01 · First principlesThe laziest possible learner

Every learner exploits some assumed structure in the data. k-NN exploits the weakest one imaginable: smoothness — nearby inputs tend to share labels. If you believe only that and nothing else, the honest algorithm writes itself: store every training point, and to predict at x, find the k stored points nearest to x and let them vote (or average, for regression).

There is no training phase at all. No parameters are fit, no loss is minimised; the "model" is the training set itself plus a distance function. All the cost moves to inference, where a naive query scans all n points. This inversion (zero work up front, full work per query) is the opposite of a parametric model like an SVM, which spends effort once to compress the data into a few parameters.

One sentence: k-NN is the claim that memory plus a good notion of similarity is already a classifier.

02 · The knobk is the bias–variance dial

k-NN puts the bias–variance tradeoff on a single integer, which makes it the cleanest teaching example in the field.

k = 1 · memorisation

Every training point is its own island; the decision boundary wraps around individual points, including mislabelled ones. Zero training error, maximal variance: redraw the dataset and the boundary changes everywhere.

k = n · the majority oracle

Every query gets the same answer: the global majority class. The boundary disappears entirely. Maximal bias, zero variance — the model has stopped looking at x.

Sliding k from 1 to n smooths the boundary continuously between these extremes. Larger k averages over more neighbours, suppressing noise (variance down) while blurring genuine fine structure (bias up). Choose k by cross-validation; odd k avoids ties in binary problems.

Same data, two values of k. At k = 1 a single mislabelled red point earns its own enclave; at k = 15 its neighbours outvote it.

03 · The real modelDistance is everything

People say k-NN "has no model." It does — the model is the distance function, and it is supplied by you, untrained. Two consequences follow immediately:

Scaling matters absolutely. Euclidean distance adds raw squared coordinate differences, so a feature measured in grams swamps one measured in kilometres regardless of which is informative. Standardise every feature (zero mean, unit variance) before anything else; with k-NN this is not hygiene, it is the model specification.
Irrelevant features poison it. Each useless dimension adds noise to every distance. Add enough of them and the nearest neighbours become effectively random, while a decision tree would simply never split on those features. k-NN has no mechanism for ignoring anything.

d(x, x′) = ( Σ_j=1..d w_j (x_j − x′_j)² )^1/2 — every choice of w_j is a different model

04 · The nemesisHigh dimensions

The smoothness bet requires that "nearest" neighbours are actually near. In high dimensions they are not: volume concentrates so that all pairwise distances become nearly equal, and the ratio of nearest to farthest neighbour distance drifts toward 1. To capture even a small fixed fraction of the data within a neighbourhood, the neighbourhood's radius must grow until "local" is meaningless. The voting neighbours are then no closer to the query than random points, and k-NN degrades to guessing the prior.

This is the curse of dimensionality in its purest form, and k-NN is its most direct victim because k-NN has nothing else to fall back on — no learned weights, no margins, no feature selection.

05 · RedemptionEmbeddings made k-NN modern

The fix was not to repair k-NN but to repair the space it operates in. Deep networks learn embeddings in which semantic similarity is geometric proximity — and in such a space, nearest-neighbour lookup is exactly the right algorithm again. Add approximate nearest-neighbour indexes (HNSW, IVF, product quantisation) to cut query cost from O(n) to roughly O(log n), and the lazy learner becomes infrastructure:

Retrieval and RAG: embed documents, embed the query, take the k nearest — modern search and retrieval-augmented generation are k-NN over learned vectors.
Recommendation, deduplication, face recognition: all nearest-neighbour queries in an embedding space.
Zero-shot classification: nearest label embedding (the CLIP pattern — see few-shot / zero-shot).

The arc: as a classifier on raw tabular features, k-NN is a teaching tool. As the lookup step on top of learned representations, it quietly runs a large fraction of production ML.

Mental Model

k-NN assumes only smoothness: nearby points share labels. Store everything, vote among the k nearest.
k is the bias–variance dial: k = 1 memorises (variance), k = n predicts the majority (bias).
The distance function is the model. Scale your features; irrelevant dimensions poison every distance.
High dimensions kill it — all points become equidistant and "nearest" stops meaning anything.
Learned embeddings + approximate NN resurrect it as retrieval: RAG is k-NN wearing a transformer.