General ML

Newton's Method

Fit a quadratic, jump to its bottom

01 · First principlesWhat gradient descent cannot see

The gradient is a first-order object: it knows which way is downhill, not how the slope is changing. So gradient descent takes the same kind of step in a gentle valley and on a cliff face, and the user compensates by hand-tuning η. In a long narrow valley (badly conditioned curvature, the subject of data whitening) it zig-zags across the steep direction while inching along the shallow one.

The information GD is missing is curvature — the second derivative. Newton's question: if we are willing to pay for the Hessian, can we skip the tuning entirely?

02 · The ideaModel locally, solve the model exactly

Take the second-order Taylor expansion at the current point — a quadratic bowl that matches the true loss in value, slope, and curvature — and jump straight to the bottom of that bowl:

L(x+Δ) ≈ L(x) + ∇LTΔ + ½ ΔT
Δ(·) = 0  ⇒  HΔ = −∇L
x ← x − H−1∇L

Notice what H−1 does to the geometry: it rescales each curvature direction by its own stiffness — big steps along flat directions, small steps across steep ones. Newton's method is gradient descent with a perfect, automatically chosen, matrix-valued learning rate. On an exactly quadratic loss it converges in one step from anywhere.

true loss L(x) xₜ local quadratic fit xₜ₊₁ = min of the quadratic

One Newton step: fit the dashed quadratic at xt, jump to its minimum. Near the optimum the fit is nearly exact, so the jumps land nearly on target.

03 · The prizeQuadratic convergence

Near a well-behaved minimum the error roughly squares each iteration:

‖xt+1 − x*‖ ≤ C ‖xt − x*2

Squaring the error doubles the number of correct digits per step: 10−2 → 10−4 → 10−8. Gradient descent, by contrast, multiplies the error by a constant factor each step (linear convergence) and pays more the worse the conditioning. When Newton applies — smooth loss, modest dimension, near a minimum — nothing first-order competes. Logistic regression solvers and many classical statistics fits are Newton (or its cousin, iteratively reweighted least squares) for exactly this reason.

04 · The exileWhy deep learning cannot afford it

The costs are all in the Hessian, and they scale brutally with dimension d:

  1. Memory, O(d²). For d = 109 parameters, H has 1018 entries. Storing it is out of the question before anything is computed.
  2. Solving, O(d³). The step requires solving HΔ = −∇L. Even with structure-exploiting solvers, this dwarfs the cost of a gradient.
  3. Saddle attraction. The subtle one: Newton jumps to the stationary point of the local quadratic, whatever it is. At a saddle, negative curvature directions get a flipped sign and the method walks toward the saddle. High-dimensional loss surfaces are mostly saddles, so pure Newton homes in on exactly the points we want to escape.

Also, a stochastic Hessian estimated on a mini-batch is far noisier than a stochastic gradient — inverting noise amplifies it. Newton's precision is wasted on the noisy, non-convex, billion-dimensional setting.

The fixes, in one line: damping (solve with H + λI, sliding between Newton at λ→0 and gradient descent at λ→∞) and trust regions (only trust the quadratic inside a radius) repair divergence and saddle attraction — but not the O(d²) bill. Cheaper curvature is its own family: second-order methods.

05 · Where it livesThe verdict

SettingNewton?Why
Logistic regression, GLMs, small convex fitsYesconvex, d small, quadratic convergence shines
Classical optimisation, root findingYesthe default, with damping or trust regions
Deep networksNoO(d²) memory, O(d³) solve, saddles, mini-batch noise
Mental Model