General ML

Backpropagation

All the gradients for the price of one forward pass

01 · First principlesThe problem that forces it to exist

Gradient descent needs ∂L/∂θ for every parameter, every step. The naive way to get a derivative is finite differences: nudge θ_j, rerun the network, see how L moved.

∂L/∂θ_j ≈ ( L(θ + εe_j) − L(θ) ) / ε → one forward pass per parameter

With 10⁹ parameters that is a billion forward passes for a single optimisation step. Not slow — impossible. The question backprop answers: can we get all the partial derivatives for roughly the cost of one pass? Yes, because the derivatives massively share structure.

02 · The ideaChain rule + dynamic programming

A network is a composition: x → h₁ → h₂ → … → h_n → L. The chain rule says the gradient with respect to anything early is a product of local derivatives along the path. The key observation is about reuse: every parameter in layer k needs the same downstream factor ∂L/∂h_k. Compute that quantity once, and all of layer k's parameter gradients are one cheap local step away.

So we sweep backwards, defining δ_k = ∂L/∂h_k and reusing it:

δ_n = ∂L/∂h_n
δ_k = δ_k+1 · ∂h_k+1/∂h_k (reuse, never recompute)
∂L/∂θ_k = δ_k · ∂h_k/∂θ_k

This is textbook dynamic programming: the naive method recomputes the same downstream product over and over; backprop memoises it. One backward sweep visits each edge of the computation graph once.

One forward sweep stores activations; one backward sweep reuses each δ to price every parameter.

03 · The mechanicsVector–Jacobian products everywhere

Each layer's local derivative ∂h_k+1/∂h_k is a Jacobian, and for a hidden width of 4096 that matrix has ~16M entries per example. Backprop never builds it. What the backward sweep actually needs is only the product of a row vector with that Jacobian:

δ_k = δ_k+1 J_k — a vector–Jacobian product, computed directly

For a linear layer h_k+1 = W h_k, the VJP is just δ_k+1W — a matrix–vector multiply, same shape of work as the forward pass. For an elementwise activation it is an elementwise multiply. Every primitive in an autodiff framework ships its VJP rule; "backprop" is the composition of these rules in reverse topological order. (Forward-mode autodiff composes Jacobian–vector products instead — efficient for few inputs and many outputs, which is the opposite of training, where we have one scalar loss and millions of inputs.)

Cost: each VJP costs about as much as the layer's forward computation, so the whole backward pass ≈ one extra forward pass. Total gradient cost is ~2–3× a forward run, independent of parameter count. That constant is the reason deep learning is feasible.

04 · The priceThe memory bill

The local derivatives are functions of the forward activations (the VJP through W needs h_k; the VJP through ReLU needs the sign of its input). So the forward pass must cache every intermediate activation until the backward pass consumes it. Training memory is dominated not by the weights but by activations × batch size × depth.

The standard escape is gradient checkpointing: store activations only every few layers, recompute the rest during the backward pass. It trades roughly one extra forward pass of compute for an O(√depth) memory footprint — the same compute-for-memory bargain backprop itself declined to make.

05 · PerspectiveWhat backprop is not

It is not a learning rule and not specific to neural networks; it is an algorithm for computing exact derivatives of any composed differentiable program. The learning happens in the optimiser that consumes the gradients (SGD, Adam). And it propagates whatever the loss provides — which is why the shape of the loss gradient and the saturation of activations matter: backprop multiplies local derivatives, and long products of numbers below one vanish.

Mental Model

Naive differentiation costs one forward pass per parameter; backprop notices the derivatives share all their downstream structure.
It is dynamic programming on the chain rule: compute each δ = ∂L/∂h once, sweeping backwards.
Everything is vector–Jacobian products; Jacobians are never materialised.
Compute: ~one extra forward pass, regardless of parameter count. Memory: every activation cached — that is the bill.
Backprop computes gradients; it does not learn. The optimiser does.