General ML

SVMs

Of all the boundaries that work, pick the one with room to spare

01 · First principlesThe problem of too many answers

Take a linearly separable dataset. Infinitely many hyperplanes separate it, and every one of them achieves zero training error. The perceptron stops at whichever it stumbles into first; nothing in the training data prefers one over another. So the question that forces SVMs to exist is: which separator should we trust on points we have not seen?

The SVM's answer is the hyperplane that maximises the margin — the distance to the closest training point on either side. A boundary that skims past training points by a hair will misclassify their slightly-perturbed cousins; a boundary with a wide buffer tolerates noise. Maximising the margin makes the choice unique, and it makes it the most robust one available.

wᵀx + b = 0 MARGIN = 2 / ‖w‖ ○ = support vectors

Only the circled points touch the margin. Delete every other point and retrain: the boundary does not move.

02 · The geometryMargin, and the few points that matter

Scale w, b so that the closest points satisfy |wTx + b| = 1. The margin width is then 2/‖w‖, so maximising the margin is minimising ‖w‖²:

minw,b  ½‖w‖²   s.t.   yi(wTxi + b) ≥ 1  for all i

At the optimum, only the points whose constraints are tight — the ones sitting exactly on the margin — carry nonzero Lagrange multipliers αi. These are the support vectors, and the solution is literally built from them alone: w = Σ αi yi xi. Most of the dataset contributes nothing. The model is a sparse summary of the boundary region, which is why SVMs were the favourite when datasets were small and every point was scrutinised.

03 · Failure → fixSoft margins and hinge loss

Real data is not separable, and the hard constraint above then has no feasible solution — one mislabelled point destroys the whole program. The fix is to let points buy their way across the margin with slack ξi ≥ 0, at a price:

minw,b  ½‖w‖² + C Σ ξi   s.t.   yi(wTxi + b) ≥ 1 − ξi

C is a bias–variance knob: large C punishes every violation, hugging the data (low bias, high variance); small C tolerates violations for a wider, calmer margin. Eliminating ξ gives the unconstrained form

minw,b  Σ max(0, 1 − yi(wTxi + b)) + λ‖w‖²

which reveals the SVM as ordinary regularised loss minimisation with the hinge loss: zero penalty once a point is correctly classified with room to spare, linear penalty otherwise. (Logistic regression differs only in swapping the hinge for log-loss; the kink at 1 is what creates exact sparsity in the support vectors.)

04 · The magicThe kernel trick, stated precisely

Linear boundaries are weak. The classical remedy — map x to a high-dimensional feature vector φ(x) and separate there — seems to cost the dimension of φ in both compute and memory. The escape comes from the dual program:

maxα  Σ αi − ½ Σi,j αiαj yiyj ⟨xi, xj
data enters only here, as dot products

The data appears only through pairwise dot products — never as raw coordinates. So replace every ⟨xi, xj⟩ with a kernel k(xi, xj) = ⟨φ(xi), φ(xj)⟩ that computes the inner product in feature space without ever constructing φ. The RBF kernel exp(−γ‖x − x′‖²) corresponds to an infinite-dimensional φ, yet costs one subtraction and one exponential. Prediction likewise needs only kernels against the support vectors. You work in an implicit space of arbitrary dimension and pay only n² kernel evaluations.

Why this is legal: Mercer's condition — any symmetric positive semi-definite k is the inner product of some feature map. You choose the similarity function; the feature space is implied, sight unseen.

05 · The verdictWhy deep learning won, where SVMs still do

A kernel is a fixed similarity function chosen before seeing the task. Deep networks learn the feature map φ itself from data, end to end — and given enough data, learned features beat any hand-picked kernel, especially on images and text where raw-input similarity is nearly meaningless. The n² kernel matrix also makes classical SVMs awkward past a few hundred thousand points. Kernels lost on representation, then lost again on scale.

RegimeVerdictWhy
Small, clean tabular data (10²–10⁴ rows)SVM strongMargin maximisation squeezes the most out of few points; convex, reproducible, few hyperparameters.
Medium tabularTrees usually winGradient boosting handles mixed types and interactions with less tuning.
Images, text, audio at scaleSVM losesFixed kernels cannot compete with learned representations (CNNs, transformers); n² does not scale.
Mental Model