All the gradients for the price of one forward pass
Gradient descent needs ∂L/∂θ for every parameter, every step. The naive way to get a derivative is finite differences: nudge θj, rerun the network, see how L moved.
With 109 parameters that is a billion forward passes for a single optimisation step. Not slow — impossible. The question backprop answers: can we get all the partial derivatives for roughly the cost of one pass? Yes, because the derivatives massively share structure.
A network is a composition: x → h1 → h2 → … → hn → L. The chain rule says the gradient with respect to anything early is a product of local derivatives along the path. The key observation is about reuse: every parameter in layer k needs the same downstream factor ∂L/∂hk. Compute that quantity once, and all of layer k's parameter gradients are one cheap local step away.
So we sweep backwards, defining δk = ∂L/∂hk and reusing it:
This is textbook dynamic programming: the naive method recomputes the same downstream product over and over; backprop memoises it. One backward sweep visits each edge of the computation graph once.
One forward sweep stores activations; one backward sweep reuses each δ to price every parameter.
Each layer's local derivative ∂hk+1/∂hk is a Jacobian, and for a hidden width of 4096 that matrix has ~16M entries per example. Backprop never builds it. What the backward sweep actually needs is only the product of a row vector with that Jacobian:
For a linear layer hk+1 = W hk, the VJP is just δk+1W — a matrix–vector multiply, same shape of work as the forward pass. For an elementwise activation it is an elementwise multiply. Every primitive in an autodiff framework ships its VJP rule; "backprop" is the composition of these rules in reverse topological order. (Forward-mode autodiff composes Jacobian–vector products instead — efficient for few inputs and many outputs, which is the opposite of training, where we have one scalar loss and millions of inputs.)
The local derivatives are functions of the forward activations (the VJP through W needs hk; the VJP through ReLU needs the sign of its input). So the forward pass must cache every intermediate activation until the backward pass consumes it. Training memory is dominated not by the weights but by activations × batch size × depth.
The standard escape is gradient checkpointing: store activations only every few layers, recompute the rest during the backward pass. It trades roughly one extra forward pass of compute for an O(√depth) memory footprint — the same compute-for-memory bargain backprop itself declined to make.
It is not a learning rule and not specific to neural networks; it is an algorithm for computing exact derivatives of any composed differentiable program. The learning happens in the optimiser that consumes the gradients (SGD, Adam). And it propagates whatever the loss provides — which is why the shape of the loss gradient and the saturation of activations matter: backprop multiplies local derivatives, and long products of numbers below one vanish.