Linear Algebra

Positive Semi-Definite Matrices

The quadratic form never points downhill

01 · First principlesOne inequality, every direction

A symmetric matrix A defines a landscape: to every direction x it assigns the number xᵀAx (the quadratic form — for unit x, "how much A-ness lies along x", a second-order cousin of the dot product). Positive semi-definiteness is a single promise about that landscape:

xᵀAx ≥ 0   for every x
no direction is downhill; the landscape never dips below zero

Strictly positive for all x ≠ 0 makes A positive definite (PD); allowing equality makes it semi-definite (PSD), where some directions are exactly flat. The definition sounds abstract until you notice that two of the most common matrices in ML satisfy it automatically, and that the property is exactly what makes "variance", "energy", and "convex bowl" meaningful words.

02 · The pictureBowl versus saddle

Plot f(x) = ½xᵀAx. If A is PD, every direction curves upward: a bowl, with one bottom. If A has a negative eigenvalue, some direction curves down: a saddle — uphill one way, downhill another, and "the minimum" stops existing. PSD is the boundary case: a bowl with some perfectly flat valley directions (the null space of A).

PD: BOWL — ALL λ > 0, ONE MINIMUM INDEFINITE: SADDLE — MIXED SIGNS OF λ

Level sets of ½xᵀAx. PD gives nested ellipses around a true minimum; one negative eigenvalue bends the contours into hyperbolas and opens an escape route downhill.

The eigenvalue translation is immediate. In the eigenbasis (spectral theorem — symmetric A is a rotate–stretch–rotate-back), the form becomes xᵀAx = Σ λᵢ yᵢ², a weighted sum of squares. It is nonnegative for all inputs precisely when every weight is nonnegative: PSD ⇔ all λᵢ ≥ 0. PSD matrices are the matrices with no negative stretch anywhere.

03 · The two familiesWhy covariance and Gram matrices are born PSD

Nearly every PSD matrix you will meet belongs to one of two families, and each is PSD for a one-line reason worth internalising.

Covariance Σ = E[(z−μ)(z−μ)ᵀ]
For any direction u, the form is a variance: uᵀΣu = E[(uᵀ(z−μ))²] = Var(uᵀz) ≥ 0. A negative value would mean negative variance along some projection — nonsense. PSD is the matrix-shaped statement "spread cannot be negative".
Gram / AᵀA
For any x: xᵀ(AᵀA)x = (Ax)ᵀ(Ax) = ‖Ax‖² ≥ 0. Squared lengths cannot be negative. Kernel matrices, XᵀX in regression, and JᵀJ in Gauss–Newton are all this family; zero eigenvalues appear exactly where A has a null space.

In fact the second family is exhaustive: every PSD matrix can be written as BᵀB for some B. PSD matrices are precisely the matrices that are secretly a "squared" map — which is why they behave like nonnegative numbers among matrices, and why the next section's square root exists.

04 · The working toolCholesky: the square root you actually use

For a PD matrix there is a canonical, cheap factorisation — the Cholesky decomposition:

A = L Lᵀ,    L lower-triangular, positive diagonal
half the cost of LU; exists iff A is PD (the practical PD test)

L behaves as the working square root of A, and two everyday operations run on it:

  1. Sampling correlated Gaussians. If z ~ N(0, I) then x = μ + Lz has covariance E[Lz(Lz)ᵀ] = LLᵀ = Σ. The triangular L is the map that "installs" the desired correlations into white noise — every multivariate-Gaussian sampler (VAE reparameterisation with full covariance, GP posterior draws) is this line.
  2. Solving SPD systems. Σ⁻¹b in a Gaussian log-density or (K + σ²I)⁻¹y in GP regression is two triangular solves through L — never an explicit inverse — and log det Σ = 2Σ log Lᵢᵢ falls out of the diagonal for free (see determinant).

When a "covariance" fails Cholesky because round-off produced a tiny negative eigenvalue, the standard fix is the λI lift from singular matrices: add jitter, restore the floor.

The named connection: PSD is the precondition for optimisation's nicest regime — a PSD Hessian is what "locally convex" means, Gauss–Newton replaces the true Hessian with the born-PSD JᵀJ to guarantee downhill steps, and Adam's denominator is a diagonal PSD estimate. Convexity, variance, and kernels are one property wearing three coats.
Mental Model