General ML

Dimensionality Reduction

Finding the few coordinates the data actually uses

01 · First principlesHigh-dimensional data is mostly redundancy

A 1000-dimensional dataset rarely contains 1000 dimensions of information. Pixels move together (edges, lighting), survey answers correlate, sensor channels echo one another. The curse-of-dimensionality note ended on the reprieve: real data concentrates near a low-dimensional manifold inside the ambient space. Dimensionality reduction is the constructive follow-up — find coordinates for that manifold, so that downstream work pays for the intrinsic dimension instead of the ambient one. The question every method must answer: which structure do you preserve, and which do you sacrifice? No projection keeps everything; the methods differ only in what they choose to lose.

02 · LinearPCA: rotate, rank, truncate

PCA's choice: preserve global variance. Find the direction along which the (centred) data varies most; then the most-varying direction orthogonal to it; and so on. Those directions are exactly the eigenvectors of the covariance matrix Σ, ordered by eigenvalue:

Σ = V Λ V^T ⟹ z = V_k^T(x − μ) — keep the top-k eigenvectors; λ_i = variance captured by axis i

Geometrically: rotate to the point cloud's own axes (where Σ becomes diagonal — all correlation converted to axis-aligned variance), then drop the flattest axes. Among all linear projections to k dimensions, this one provably minimises squared reconstruction error; equivalently it keeps the largest share of total variance.

PCA on a correlated cloud: the eigenvectors of Σ are the cloud's own axes; truncation discards the directions with least variance.

What it cannot do follows from "linear" and "variance". A spiral, a swiss roll, concentric rings — any manifold that curves — projects badly onto flat axes; and variance is not always what matters (the discriminative direction between two classes can be the low-variance one). PCA also entangles scale: run it on standardised features, or the largest-unit feature wins by accounting fraud.

03 · Nonlineart-SNE and UMAP: preserve neighbourhoods, not geometry

For visualisation the goal flips: we do not need faithful global geometry in 2D (impossible anyway); we need who is near whom. t-SNE and UMAP build a neighbour graph in high dimensions and lay it out in 2D so that neighbours stay neighbours — local structure is the contract, and everything global is negotiable. The result is excellent cluster pictures and a long list of standard misreadings:

What a t-SNE/UMAP plot does not tell you: cluster sizes (the algorithms expand dense regions and contract sparse ones), distances between clusters (two clusters far apart in the plot may not be far apart in the data), or density. Two runs can also differ wildly with seed and perplexity. Read the plot as "these points are mutual neighbours", and nothing more quantitative.

These are visualisation instruments, not preprocessing steps: feeding t-SNE coordinates into a downstream model imports all of those distortions as features.

04 · LearnedAutoencoders: nonlinear PCA with a budget

An autoencoder squeezes the data through a k-dimensional bottleneck and demands reconstruction: encoder f, decoder g, minimise ‖x − g(f(x))‖². The bottleneck forces the network to spend its k coordinates on whatever actually varies — a learned, nonlinear chart of the manifold. The connection to PCA is exact in the degenerate case: with linear f, g and squared error, the optimal bottleneck spans the top-k principal subspace. Depth and nonlinearity buy the ability to flatten curved manifolds, at the usual prices — training, tuning, and coordinates with no closed-form meaning. (Variational autoencoders add a probabilistic prior on the bottleneck; the KL note covers the term that does it.)

05 · ChoosingMatch the method to the question

You want	Reach for	It preserves	It sacrifices
Fast, interpretable compression; decorrelation; denoising	PCA	Global variance, distances (approx.), linear structure	Anything curved; discriminative low-variance directions
A 2D picture of cluster structure for human eyes	t-SNE / UMAP	Local neighbourhoods	Global distances, cluster sizes, densities, reproducibility
Compact features for a downstream model; curved manifolds	Autoencoder	Whatever reconstruction needs, nonlinearly	Interpretability, convexity, free lunch generally
Distance preservation with zero fitting (huge d, streaming)	Random projection	Pairwise distances (Johnson–Lindenstrauss)	Any notion of meaningful axes

A serviceable default pipeline: PCA to ~50 dimensions first (cheap, removes noise and redundancy), then t-SNE/UMAP for looking, or the PCA coordinates themselves for modelling.

Mental Model

Real data uses far fewer coordinates than it is stored in; reduction recovers the chart of the manifold.
Every method is defined by what it agrees to lose — there is no lossless projection.
PCA = eigendecomposition of Σ: rotate to the cloud's own axes, drop the flat ones; optimal among linear maps, blind to curvature.
t-SNE/UMAP keep neighbourhoods only — never read sizes or inter-cluster distances off the plot.
Autoencoders are PCA with the linearity removed: a learned bottleneck chart, priced in tuning.