Linear Algebra

Jacobian

The matrix a nonlinear map becomes when you zoom in

01 · First principlesZoom in until everything is linear

A nonlinear map f: ℝⁿ → ℝᵐ bends and warps space; nothing in linear algebra applies to it directly. But zoom in on any single point x and the warping flattens out — locally, every smooth map is a linear map plus an error that shrinks faster than the zoom. The matrix of that local linear map is the Jacobian:

f(x + δ) ≈ f(x) + J δ, J_ij = ∂f_i / ∂x_j

row i: how output i responds to each input · m×n

Read it by rows or columns, both are useful: row i is the gradient of output fᵢ; column j is the direction in output space you move when nudging input xⱼ. The Jacobian is the bridge that lets every concept in this chapter — rank, null spaces, determinants, conditioning — apply to neural networks, one point at a time.

Globally, f warps the grid. Locally, the tiny blue square maps to a tiny parallelogram — a linear transformation. J is that transformation; |det J| is the parallelogram's area ratio.

02 · CompositionThe chain rule is a matrix product

Compose two maps, h = g ∘ f. Locally, "do the linear map of f, then the linear map of g" — and composing linear maps is multiplying matrices:

J_h(x) = J_g(f(x)) · J_f(x)

the multivariable chain rule, stated honestly

Every scalar chain-rule computation you have ever done is a 1×1 special case of this product. A depth-L network is a composition of L maps, so the network's Jacobian is a product of L layer Jacobians — and at once two classics become transparent: vanishing/exploding gradients are this product's norm collapsing or blowing up with depth (a spectral fact — see eigenvalues and dynamics), and orthogonal initialisation is the choice that keeps each factor norm-preserving.

03 · The entire point of backpropNever materialise J

Here is the fact that makes deep learning computationally possible. For a network with n parameters and scalar loss, the full Jacobian chain would involve matrices of astronomical size — the Jacobian of layer activations with respect to all parameters is far too large to store, let alone multiply. Backprop never builds any of them. It computes vector–Jacobian products: given a row vector vᵀ, each layer returns vᵀJ directly, by a cheap formula specific to that layer, without ever forming J.

∇_xL = ( ( (∂L/∂y)ᵀ J_L ) J_L−1 ⋯ ) J₁

always (row vector) × (matrix): each step costs one VJP, ≈ one forward pass

The order of multiplication is the entire design. Multiplying the chain right-to-left (matrix × matrix, then × vector at the end) would cost matrix–matrix products; multiplying left-to-right keeps every intermediate a vector, so the whole gradient costs about as much as one extra forward pass — regardless of how many millions of parameters there are. That asymmetry is reverse-mode autodiff, and it is why we can afford gradients of scalar losses but not, say, full Jacobians of vector outputs. (Forward mode computes Jacobian–vector products Jv instead — cheap per input direction, which is the wrong economy when inputs are millions of parameters and the output is one loss.)

One sentence to keep: backprop is not "compute the Jacobians and multiply" — it is the discipline of only ever computing vᵀJ, which is why it scales.

04 · The volume reading|det J| and change of variables

When f maps ℝⁿ → ℝⁿ and is invertible, the local parallelogram picture has a number attached: the tiny square of volume dV lands on a parallelepiped of volume |det J| dV. The Jacobian determinant is the local volume exchange rate, varying from point to point as the warp compresses here and stretches there. Push a probability density through f and mass conservation forces the compensation

p_x(x) = p_z(f(x)) · |det J_f(x)|

where f compresses volume, density concentrates

This is the change-of-variables formula, the engine of normalizing flows — and the reason flow architectures are built so that det J is cheap (triangular Jacobians; the full story is in determinant, and the continuous-time version, where the determinant relaxes into a trace, powers likelihoods in the probability-flow ODE and flow matching).

The named connection: one matrix, three readings that ML uses daily — J as sensitivity (adversarial robustness, influence of inputs), vᵀJ as backprop's unit of work, |det J| as the exchange rate between densities in generative flows.

Mental Model

The Jacobian is what a nonlinear map looks like under a microscope: the best linear approximation at a point, J_ij = ∂fᵢ/∂xⱼ.
Chain rule = Jacobian product; a deep network's Jacobian is a product over layers, and gradient pathologies are that product's spectrum.
Backprop's entire trick is order of operations: keep everything a vector–Jacobian product and the gradient costs one forward pass.
Reverse mode is cheap per output, forward mode per input — scalar losses with millions of parameters make the choice for you.
|det J| is the local volume exchange rate: the term that lets flows turn Gaussians into data while keeping exact likelihoods.