General ML

SVMs

Of all the boundaries that work, pick the one with room to spare

01 · First principlesThe problem of too many answers

Take a linearly separable dataset. Infinitely many hyperplanes separate it, and every one of them achieves zero training error. The perceptron stops at whichever it stumbles into first; nothing in the training data prefers one over another. So the question that forces SVMs to exist is: which separator should we trust on points we have not seen?

The SVM's answer is the hyperplane that maximises the margin — the distance to the closest training point on either side. A boundary that skims past training points by a hair will misclassify their slightly-perturbed cousins; a boundary with a wide buffer tolerates noise. Maximising the margin makes the choice unique, and it makes it the most robust one available.

Only the circled points touch the margin. Delete every other point and retrain: the boundary does not move.

02 · The geometryMargin, and the few points that matter

Scale w, b so that the closest points satisfy |w^Tx + b| = 1. The margin width is then 2/‖w‖, so maximising the margin is minimising ‖w‖²:

min_w,b ½‖w‖² s.t. y_i(w^Tx_i + b) ≥ 1 for all i

At the optimum, only the points whose constraints are tight — the ones sitting exactly on the margin — carry nonzero Lagrange multipliers α_i. These are the support vectors, and the solution is literally built from them alone: w = Σ α_i y_i x_i. Most of the dataset contributes nothing. The model is a sparse summary of the boundary region, which is why SVMs were the favourite when datasets were small and every point was scrutinised.

03 · Failure → fixSoft margins and hinge loss

Real data is not separable, and the hard constraint above then has no feasible solution — one mislabelled point destroys the whole program. The fix is to let points buy their way across the margin with slack ξ_i ≥ 0, at a price:

min_w,b ½‖w‖² + C Σ ξ_i s.t. y_i(w^Tx_i + b) ≥ 1 − ξ_i

C is a bias–variance knob: large C punishes every violation, hugging the data (low bias, high variance); small C tolerates violations for a wider, calmer margin. Eliminating ξ gives the unconstrained form

min_w,b Σ max(0, 1 − y_i(w^Tx_i + b)) + λ‖w‖²

which reveals the SVM as ordinary regularised loss minimisation with the hinge loss: zero penalty once a point is correctly classified with room to spare, linear penalty otherwise. (Logistic regression differs only in swapping the hinge for log-loss; the kink at 1 is what creates exact sparsity in the support vectors.)

04 · The magicThe kernel trick, stated precisely

Linear boundaries are weak. The classical remedy — map x to a high-dimensional feature vector φ(x) and separate there — seems to cost the dimension of φ in both compute and memory. The escape comes from the dual program:

max_α Σ α_i − ½ Σ_i,j α_iα_j y_iy_j ⟨x_i, x_j⟩

data enters only here, as dot products

The data appears only through pairwise dot products — never as raw coordinates. So replace every ⟨x_i, x_j⟩ with a kernel k(x_i, x_j) = ⟨φ(x_i), φ(x_j)⟩ that computes the inner product in feature space without ever constructing φ. The RBF kernel exp(−γ‖x − x′‖²) corresponds to an infinite-dimensional φ, yet costs one subtraction and one exponential. Prediction likewise needs only kernels against the support vectors. You work in an implicit space of arbitrary dimension and pay only n² kernel evaluations.

Why this is legal: Mercer's condition — any symmetric positive semi-definite k is the inner product of some feature map. You choose the similarity function; the feature space is implied, sight unseen.

05 · The verdictWhy deep learning won, where SVMs still do

A kernel is a fixed similarity function chosen before seeing the task. Deep networks learn the feature map φ itself from data, end to end — and given enough data, learned features beat any hand-picked kernel, especially on images and text where raw-input similarity is nearly meaningless. The n² kernel matrix also makes classical SVMs awkward past a few hundred thousand points. Kernels lost on representation, then lost again on scale.

Regime	Verdict	Why
Small, clean tabular data (10²–10⁴ rows)	SVM strong	Margin maximisation squeezes the most out of few points; convex, reproducible, few hyperparameters.
Medium tabular	Trees usually win	Gradient boosting handles mixed types and interactions with less tuning.
Images, text, audio at scale	SVM loses	Fixed kernels cannot compete with learned representations (CNNs, transformers); n² does not scale.

Mental Model

Many hyperplanes fit; the max-margin one is the unique choice that buys robustness to perturbation.
The solution is built entirely from the support vectors — the few points touching the margin.
C prices margin violations: the soft-margin SVM is hinge loss + L2 regularisation in disguise.
The dual sees data only through dot products, so a kernel swaps in an implicit high-dimensional space at no dimensional cost.
Fixed kernels lost to learned features at scale; on small clean datasets the SVM remains a first-rate default.