Cheap noisy steps downhill, and why they win
We have a loss L(θ) over millions of parameters and we cannot solve ∇L = 0 in closed form. The only thing we can afford to compute at a point is local: the gradient. So we ask the local question — among all unit directions d, which one decreases L fastest right here?
That is the entire justification: −∇L is the locally steepest descent direction. Nothing about it is global; the gradient knows the terrain only at the point where it stands. Hence the iteration:
The analogy is honest, not decorative: descending a mountain in fog. You cannot see the valley; you can feel the slope under your feet, so you step downhill and re-check. The step size η decides whether you walk or fall.
The training loss is an average over n examples, so the honest gradient is too:
Full-batch gradient descent computes this exactly. The breakage is cost, not correctness: with n in the billions, one step costs one pass over the entire dataset. You get a perfect direction you can afford to use a handful of times. Worse, the precision is wasted — far from the optimum, a rough estimate of "downhill" is all you need; exactness only matters near the end.
Sample one example (or a mini-batch B) and use its gradient instead. It is an unbiased estimate of the true gradient: E[∇ℓi] = ∇L. Each step is now n/|B| times cheaper, so in the time full-batch takes one exact step, SGD takes thousands of noisy ones. On wall-clock — the only clock that matters — noisy and many beats exact and few.
The tradeoff the fix introduces: the iterates never settle. Near a minimum the noise keeps kicking the parameters around, so plain SGD with fixed η converges to a fuzzy ball around the optimum, not a point (decaying η shrinks the ball). But the noise turns out to be partly a feature: it bounces the iterates out of sharp, narrow minima and lets them rest only in flat ones, which tend to generalise better. SGD is an optimiser with a built-in regulariser.
η is THE hyperparameter. On a quadratic bowl the dynamics are exactly three regimes: too small and you crawl (each step removes a fixed fraction of the remaining distance, very slowly); just right and you converge fast; too large and each step overshoots the minimum by more than it gained, so the error grows geometrically — divergence.
Same bowl, three step sizes. Overshooting by more than you descend compounds — the red path climbs out.
In practice nobody holds η fixed: warm up (large early steps on a random init are dangerous), then decay (cosine or step schedules) so the noise ball shrinks as you approach the answer.
Mini-batches are the engineering compromise: averaging |B| gradients cuts noise variance by 1/|B|, and GPUs compute a batch nearly as fast as a single example, so batching is almost free variance reduction. Past the point where hardware saturates, bigger batches stop paying.
Momentum in two lines: keep a running average of gradients and step along it.
Components of the gradient that persist across steps (the valley floor) accumulate; components that flip sign (the steep walls, the noise) cancel. A heavy ball rolling downhill instead of a hiker who forgets each step. This is also the first moment inside Adam; what SGD still lacks — a per-parameter step size — is that note's subject, and what it ignores about curvature is Newton's method.