The matrix a nonlinear map becomes when you zoom in
A nonlinear map f: ℝⁿ → ℝᵐ bends and warps space; nothing in linear algebra applies to it directly. But zoom in on any single point x and the warping flattens out — locally, every smooth map is a linear map plus an error that shrinks faster than the zoom. The matrix of that local linear map is the Jacobian:
Read it by rows or columns, both are useful: row i is the gradient of output fᵢ; column j is the direction in output space you move when nudging input xⱼ. The Jacobian is the bridge that lets every concept in this chapter — rank, null spaces, determinants, conditioning — apply to neural networks, one point at a time.
Globally, f warps the grid. Locally, the tiny blue square maps to a tiny parallelogram — a linear transformation. J is that transformation; |det J| is the parallelogram's area ratio.
Compose two maps, h = g ∘ f. Locally, "do the linear map of f, then the linear map of g" — and composing linear maps is multiplying matrices:
Every scalar chain-rule computation you have ever done is a 1×1 special case of this product. A depth-L network is a composition of L maps, so the network's Jacobian is a product of L layer Jacobians — and at once two classics become transparent: vanishing/exploding gradients are this product's norm collapsing or blowing up with depth (a spectral fact — see eigenvalues and dynamics), and orthogonal initialisation is the choice that keeps each factor norm-preserving.
Here is the fact that makes deep learning computationally possible. For a network with n parameters and scalar loss, the full Jacobian chain would involve matrices of astronomical size — the Jacobian of layer activations with respect to all parameters is far too large to store, let alone multiply. Backprop never builds any of them. It computes vector–Jacobian products: given a row vector vᵀ, each layer returns vᵀJ directly, by a cheap formula specific to that layer, without ever forming J.
The order of multiplication is the entire design. Multiplying the chain right-to-left (matrix × matrix, then × vector at the end) would cost matrix–matrix products; multiplying left-to-right keeps every intermediate a vector, so the whole gradient costs about as much as one extra forward pass — regardless of how many millions of parameters there are. That asymmetry is reverse-mode autodiff, and it is why we can afford gradients of scalar losses but not, say, full Jacobians of vector outputs. (Forward mode computes Jacobian–vector products Jv instead — cheap per input direction, which is the wrong economy when inputs are millions of parameters and the output is one loss.)
When f maps ℝⁿ → ℝⁿ and is invertible, the local parallelogram picture has a number attached: the tiny square of volume dV lands on a parallelepiped of volume |det J| dV. The Jacobian determinant is the local volume exchange rate, varying from point to point as the warp compresses here and stretches there. Push a probability density through f and mass conservation forces the compensation
This is the change-of-variables formula, the engine of normalizing flows — and the reason flow architectures are built so that det J is cheap (triangular Jacobians; the full story is in determinant, and the continuous-time version, where the determinant relaxes into a trace, powers likelihoods in the probability-flow ODE and flow matching).