The matrix that knows the shape of the bowl
The gradient of a loss tells you which way is down, and nothing more. It cannot distinguish a narrow ravine from a wide bowl, a minimum from a saddle ahead, or how far the descent remains trustworthy. All of that is curvature — second-derivative information — and it lives in the Hessian, the symmetric matrix Hij = ∂²L/∂θᵢ∂θⱼ. Locally:
Equivalently, H is the Jacobian of the gradient. Because H is symmetric, the spectral theorem applies: orthonormal eigendirections with real eigenvalues, and each λᵢ is the curvature of the loss along its eigendirection — the local landscape, fully described by one spectrum.
At a critical point (∇L = 0), the eigenvalue signs are the whole verdict:
| Spectrum of H | Geometry | Verdict |
|---|---|---|
| all λ > 0 | bowl in every direction (positive definite) | local minimum |
| all λ < 0 | dome in every direction | local maximum |
| mixed signs | bowl some ways, dome others | saddle point |
| some λ = 0 | flat valley directions | degenerate — the test is silent there |
Now the high-dimensional punchline. A minimum requires every one of n eigenvalues to be positive; if signs were even roughly coin-flips, that costs 2⁻ⁿ. In million-dimensional loss surfaces, almost every critical point is a saddle, and the classical fear of "getting stuck in a bad local minimum" largely gives way to "crawling slowly past saddles" — where the escape direction exists (the negative-λ eigenvector) but the gradient there is tiny. The λ = 0 row matters too: overparameterised networks have huge flat subspaces at their minima (the null space picture), which is why "the minimum" is really a connected valley.
Away from critical points, the Hessian decides how fast gradient descent can go. On a quadratic bowl, the error along each eigendirection contracts by |1 − ηλᵢ| per step. Stability demands η < 2/λ_max; progress along the flattest direction then crawls at 1 − ηλ_min. One step size must serve both, and the tension is summarised by a single ratio, the condition number κ = λ_max / λ_min — iterations to converge scale like κ.
Elongated contours (large κ). The gradient points across the valley, not along it, so GD ricochets; rescaling by curvature would walk straight to the minimum.
The picture explains the ricochet: the gradient is perpendicular to the contour, and on an elongated bowl that direction is mostly across the valley. Each step overshoots in the steep direction and barely advances in the flat one. Round bowl (κ ≈ 1): one step lands near the centre. This is the same κ that governs error amplification in linear solves — conditioning is one concept, visiting optimisation and numerics alike.
If the bowl's elongation is the problem, undo it. Minimising the local quadratic model exactly gives the Newton step:
In the eigenbasis, this divides the component along vᵢ by λᵢ: steep directions are damped, flat directions amplified, the ellipse is rescaled into a circle, and κ effectively becomes 1 — quadratics are solved in one step. The fine print: H must be positive definite for the step to even point downhill (at a saddle, dividing by a negative λ attracts you to it), and an exact Hessian is unaffordable — n² entries to store and an O(n³) solve, with n in the millions.
Modern optimisers are a museum of partial Hessians — each keeps the curvature information it can afford: