General ML

Dimensionality Reduction

Finding the few coordinates the data actually uses

01 · First principlesHigh-dimensional data is mostly redundancy

A 1000-dimensional dataset rarely contains 1000 dimensions of information. Pixels move together (edges, lighting), survey answers correlate, sensor channels echo one another. The curse-of-dimensionality note ended on the reprieve: real data concentrates near a low-dimensional manifold inside the ambient space. Dimensionality reduction is the constructive follow-up — find coordinates for that manifold, so that downstream work pays for the intrinsic dimension instead of the ambient one. The question every method must answer: which structure do you preserve, and which do you sacrifice? No projection keeps everything; the methods differ only in what they choose to lose.

02 · LinearPCA: rotate, rank, truncate

PCA's choice: preserve global variance. Find the direction along which the (centred) data varies most; then the most-varying direction orthogonal to it; and so on. Those directions are exactly the eigenvectors of the covariance matrix Σ, ordered by eigenvalue:

Σ = V Λ VT   ⟹   z = VkT(x − μ)   — keep the top-k eigenvectors; λi = variance captured by axis i

Geometrically: rotate to the point cloud's own axes (where Σ becomes diagonal — all correlation converted to axis-aligned variance), then drop the flattest axes. Among all linear projections to k dimensions, this one provably minimises squared reconstruction error; equivalently it keeps the largest share of total variance.

PC1 · KEEP (λ₁ LARGE) PC2 · DROP (λ₂ SMALL) PROJECTING ONTO PC1 LOSES ONLY THE THIN DIRECTION

PCA on a correlated cloud: the eigenvectors of Σ are the cloud's own axes; truncation discards the directions with least variance.

What it cannot do follows from "linear" and "variance". A spiral, a swiss roll, concentric rings — any manifold that curves — projects badly onto flat axes; and variance is not always what matters (the discriminative direction between two classes can be the low-variance one). PCA also entangles scale: run it on standardised features, or the largest-unit feature wins by accounting fraud.

03 · Nonlineart-SNE and UMAP: preserve neighbourhoods, not geometry

For visualisation the goal flips: we do not need faithful global geometry in 2D (impossible anyway); we need who is near whom. t-SNE and UMAP build a neighbour graph in high dimensions and lay it out in 2D so that neighbours stay neighbours — local structure is the contract, and everything global is negotiable. The result is excellent cluster pictures and a long list of standard misreadings:

What a t-SNE/UMAP plot does not tell you: cluster sizes (the algorithms expand dense regions and contract sparse ones), distances between clusters (two clusters far apart in the plot may not be far apart in the data), or density. Two runs can also differ wildly with seed and perplexity. Read the plot as "these points are mutual neighbours", and nothing more quantitative.

These are visualisation instruments, not preprocessing steps: feeding t-SNE coordinates into a downstream model imports all of those distortions as features.

04 · LearnedAutoencoders: nonlinear PCA with a budget

An autoencoder squeezes the data through a k-dimensional bottleneck and demands reconstruction: encoder f, decoder g, minimise ‖x − g(f(x))‖². The bottleneck forces the network to spend its k coordinates on whatever actually varies — a learned, nonlinear chart of the manifold. The connection to PCA is exact in the degenerate case: with linear f, g and squared error, the optimal bottleneck spans the top-k principal subspace. Depth and nonlinearity buy the ability to flatten curved manifolds, at the usual prices — training, tuning, and coordinates with no closed-form meaning. (Variational autoencoders add a probabilistic prior on the bottleneck; the KL note covers the term that does it.)

05 · ChoosingMatch the method to the question

You wantReach forIt preservesIt sacrifices
Fast, interpretable compression; decorrelation; denoisingPCAGlobal variance, distances (approx.), linear structureAnything curved; discriminative low-variance directions
A 2D picture of cluster structure for human eyest-SNE / UMAPLocal neighbourhoodsGlobal distances, cluster sizes, densities, reproducibility
Compact features for a downstream model; curved manifoldsAutoencoderWhatever reconstruction needs, nonlinearlyInterpretability, convexity, free lunch generally
Distance preservation with zero fitting (huge d, streaming)Random projectionPairwise distances (Johnson–Lindenstrauss)Any notion of meaningful axes

A serviceable default pipeline: PCA to ~50 dimensions first (cheap, removes noise and redundancy), then t-SNE/UMAP for looking, or the PCA coordinates themselves for modelling.

Mental Model