General ML

Newton's Method

Fit a quadratic, jump to its bottom

01 · First principlesWhat gradient descent cannot see

The gradient is a first-order object: it knows which way is downhill, not how the slope is changing. So gradient descent takes the same kind of step in a gentle valley and on a cliff face, and the user compensates by hand-tuning η. In a long narrow valley (badly conditioned curvature, the subject of data whitening) it zig-zags across the steep direction while inching along the shallow one.

The information GD is missing is curvature — the second derivative. Newton's question: if we are willing to pay for the Hessian, can we skip the tuning entirely?

02 · The ideaModel locally, solve the model exactly

Take the second-order Taylor expansion at the current point — a quadratic bowl that matches the true loss in value, slope, and curvature — and jump straight to the bottom of that bowl:

L(x+Δ) ≈ L(x) + ∇L^TΔ + ½ Δ^THΔ
∇_Δ(·) = 0 ⇒ HΔ = −∇L
x ← x − H⁻¹∇L

Notice what H⁻¹ does to the geometry: it rescales each curvature direction by its own stiffness — big steps along flat directions, small steps across steep ones. Newton's method is gradient descent with a perfect, automatically chosen, matrix-valued learning rate. On an exactly quadratic loss it converges in one step from anywhere.

One Newton step: fit the dashed quadratic at x_t, jump to its minimum. Near the optimum the fit is nearly exact, so the jumps land nearly on target.

03 · The prizeQuadratic convergence

Near a well-behaved minimum the error roughly squares each iteration:

‖x_t+1 − x^*‖ ≤ C ‖x_t − x^*‖²

Squaring the error doubles the number of correct digits per step: 10⁻² → 10⁻⁴ → 10⁻⁸. Gradient descent, by contrast, multiplies the error by a constant factor each step (linear convergence) and pays more the worse the conditioning. When Newton applies — smooth loss, modest dimension, near a minimum — nothing first-order competes. Logistic regression solvers and many classical statistics fits are Newton (or its cousin, iteratively reweighted least squares) for exactly this reason.

04 · The exileWhy deep learning cannot afford it

The costs are all in the Hessian, and they scale brutally with dimension d:

Memory, O(d²). For d = 10⁹ parameters, H has 10¹⁸ entries. Storing it is out of the question before anything is computed.
Solving, O(d³). The step requires solving HΔ = −∇L. Even with structure-exploiting solvers, this dwarfs the cost of a gradient.
Saddle attraction. The subtle one: Newton jumps to the stationary point of the local quadratic, whatever it is. At a saddle, negative curvature directions get a flipped sign and the method walks toward the saddle. High-dimensional loss surfaces are mostly saddles, so pure Newton homes in on exactly the points we want to escape.

Also, a stochastic Hessian estimated on a mini-batch is far noisier than a stochastic gradient — inverting noise amplifies it. Newton's precision is wasted on the noisy, non-convex, billion-dimensional setting.

The fixes, in one line: damping (solve with H + λI, sliding between Newton at λ→0 and gradient descent at λ→∞) and trust regions (only trust the quadratic inside a radius) repair divergence and saddle attraction — but not the O(d²) bill. Cheaper curvature is its own family: second-order methods.

05 · Where it livesThe verdict

Setting	Newton?	Why
Logistic regression, GLMs, small convex fits	Yes	convex, d small, quadratic convergence shines
Classical optimisation, root finding	Yes	the default, with damping or trust regions
Deep networks	No	O(d²) memory, O(d³) solve, saddles, mini-batch noise

Mental Model

Gradient descent sees slope; Newton also sees curvature, and uses it as a perfect matrix-valued learning rate.
One step = fit the local quadratic, jump to its bottom: x ← x − H⁻¹∇L.
The prize is quadratic convergence — correct digits double per iteration near the optimum.
The exile from deep learning: O(d²) storage, O(d³) solve, and it walks toward saddles.
Damping (H + λI) interpolates between Newton and gradient descent; everything cheaper is an approximation of this note.