Of all the boundaries that work, pick the one with room to spare
Take a linearly separable dataset. Infinitely many hyperplanes separate it, and every one of them achieves zero training error. The perceptron stops at whichever it stumbles into first; nothing in the training data prefers one over another. So the question that forces SVMs to exist is: which separator should we trust on points we have not seen?
The SVM's answer is the hyperplane that maximises the margin — the distance to the closest training point on either side. A boundary that skims past training points by a hair will misclassify their slightly-perturbed cousins; a boundary with a wide buffer tolerates noise. Maximising the margin makes the choice unique, and it makes it the most robust one available.
Only the circled points touch the margin. Delete every other point and retrain: the boundary does not move.
Scale w, b so that the closest points satisfy |wTx + b| = 1. The margin width is then 2/‖w‖, so maximising the margin is minimising ‖w‖²:
At the optimum, only the points whose constraints are tight — the ones sitting exactly on the margin — carry nonzero Lagrange multipliers αi. These are the support vectors, and the solution is literally built from them alone: w = Σ αi yi xi. Most of the dataset contributes nothing. The model is a sparse summary of the boundary region, which is why SVMs were the favourite when datasets were small and every point was scrutinised.
Real data is not separable, and the hard constraint above then has no feasible solution — one mislabelled point destroys the whole program. The fix is to let points buy their way across the margin with slack ξi ≥ 0, at a price:
C is a bias–variance knob: large C punishes every violation, hugging the data (low bias, high variance); small C tolerates violations for a wider, calmer margin. Eliminating ξ gives the unconstrained form
which reveals the SVM as ordinary regularised loss minimisation with the hinge loss: zero penalty once a point is correctly classified with room to spare, linear penalty otherwise. (Logistic regression differs only in swapping the hinge for log-loss; the kink at 1 is what creates exact sparsity in the support vectors.)
Linear boundaries are weak. The classical remedy — map x to a high-dimensional feature vector φ(x) and separate there — seems to cost the dimension of φ in both compute and memory. The escape comes from the dual program:
The data appears only through pairwise dot products — never as raw coordinates. So replace every ⟨xi, xj⟩ with a kernel k(xi, xj) = ⟨φ(xi), φ(xj)⟩ that computes the inner product in feature space without ever constructing φ. The RBF kernel exp(−γ‖x − x′‖²) corresponds to an infinite-dimensional φ, yet costs one subtraction and one exponential. Prediction likewise needs only kernels against the support vectors. You work in an implicit space of arbitrary dimension and pay only n² kernel evaluations.
A kernel is a fixed similarity function chosen before seeing the task. Deep networks learn the feature map φ itself from data, end to end — and given enough data, learned features beat any hand-picked kernel, especially on images and text where raw-input similarity is nearly meaningless. The n² kernel matrix also makes classical SVMs awkward past a few hundred thousand points. Kernels lost on representation, then lost again on scale.
| Regime | Verdict | Why |
|---|---|---|
| Small, clean tabular data (10²–10⁴ rows) | SVM strong | Margin maximisation squeezes the most out of few points; convex, reproducible, few hyperparameters. |
| Medium tabular | Trees usually win | Gradient boosting handles mixed types and interactions with less tuning. |
| Images, text, audio at scale | SVM loses | Fixed kernels cannot compete with learned representations (CNNs, transformers); n² does not scale. |