General ML

Clustering (k-means and friends)

Finding groups when nobody tells you the groups

01 · First principlesWhat structure are we exploiting?

Clustering exists because real data is not uniform: points arrive in lumps, because they were generated by a small number of distinct processes (customer types, cell types, topics). With no labels available, the only usable evidence of those processes is geometry — points from the same process tend to sit near each other. Clustering is the bet that proximity in feature space is a proxy for shared origin (a bet that fails exactly when the features are scaled badly or irrelevant — the same disease that afflicts k-NN).

This is the canonical unsupervised task: there is no answer key, so every algorithm must first decide what "a group" even means, and each decides differently.

02 · k-meansThe objective, then the algorithm

k-means defines a group as a blob around a centre. Choose k centroids μ1…μk and an assignment of each point to one centroid, to minimise the within-cluster variance:

J  =  Σj=1..k Σx ∈ Cj ‖x − μj‖²
over clusters points in cluster j

Minimising J jointly over assignments and centroids is NP-hard, so we do coordinate descent: hold one block fixed, optimise the other, alternate.

  1. Assign: with centroids fixed, each point's optimal move is to its nearest centroid (each term of J is minimised independently).
  2. Update: with assignments fixed, the optimal centroid is the mean of its points (the mean minimises summed squared distance).
  3. Repeat. Each step can only lower J, and J is bounded below, so the loop converges — but only to a local minimum.
The point worth remembering: k-means is not a heuristic that happens to work. It is exact coordinate descent on a stated objective. Knowing the objective tells you precisely when it must fail.

03 · Failure modesWhat the objective cannot see

Every blind spot of k-means is readable straight off J:

WHAT K-MEANS SEES WHAT THE DATA IS centroids split both stripes true clusters are the stripes

Isotropic squared distance prefers round blobs; two parallel elongated clusters get cut across, not along.

04 · The friendsGMM, DBSCAN, hierarchical

Gaussian mixture models are k-means with the hard edges softened. Each cluster becomes a Gaussian with its own covariance, each point gets a responsibility (a probability of membership), and EM alternates exactly as k-means does:

E-step: rij = p(cluster j | xi)   ·   M-step: refit each (πj, μj, Σj) weighted by rij

k-means is the limit of EM on a GMM as the covariances shrink to εI: responsibilities harden into nearest-centroid assignments. The covariances are what buy you elliptical clusters.

DBSCAN changes the definition of a group: a cluster is a connected region of high density, grown outward from "core" points that have at least minPts neighbours within radius ε. It finds arbitrary shapes (rings, spirals), discovers k by itself, and labels sparse points as noise — at the cost of two new hyperparameters and trouble when clusters have very different densities.

Hierarchical (agglomerative) clustering refuses to pick k at all: start with every point as its own cluster, repeatedly merge the closest pair, and record the merge tree. You cut the dendrogram at whatever level the application needs. The natural choice when the data genuinely has a taxonomy (species, organisational structure).

05 · Choosing kHonestly

Two standard tools, neither of which is an oracle:

Honesty clause: if neither plot shows a clear winner, the data likely does not contain k crisp clusters, and the most defensible answer is "the clustering is a useful quantisation, not a discovered truth." Validate against a downstream task whenever one exists.
Mental Model