Finding the few coordinates the data actually uses
A 1000-dimensional dataset rarely contains 1000 dimensions of information. Pixels move together (edges, lighting), survey answers correlate, sensor channels echo one another. The curse-of-dimensionality note ended on the reprieve: real data concentrates near a low-dimensional manifold inside the ambient space. Dimensionality reduction is the constructive follow-up — find coordinates for that manifold, so that downstream work pays for the intrinsic dimension instead of the ambient one. The question every method must answer: which structure do you preserve, and which do you sacrifice? No projection keeps everything; the methods differ only in what they choose to lose.
PCA's choice: preserve global variance. Find the direction along which the (centred) data varies most; then the most-varying direction orthogonal to it; and so on. Those directions are exactly the eigenvectors of the covariance matrix Σ, ordered by eigenvalue:
Geometrically: rotate to the point cloud's own axes (where Σ becomes diagonal — all correlation converted to axis-aligned variance), then drop the flattest axes. Among all linear projections to k dimensions, this one provably minimises squared reconstruction error; equivalently it keeps the largest share of total variance.
PCA on a correlated cloud: the eigenvectors of Σ are the cloud's own axes; truncation discards the directions with least variance.
What it cannot do follows from "linear" and "variance". A spiral, a swiss roll, concentric rings — any manifold that curves — projects badly onto flat axes; and variance is not always what matters (the discriminative direction between two classes can be the low-variance one). PCA also entangles scale: run it on standardised features, or the largest-unit feature wins by accounting fraud.
For visualisation the goal flips: we do not need faithful global geometry in 2D (impossible anyway); we need who is near whom. t-SNE and UMAP build a neighbour graph in high dimensions and lay it out in 2D so that neighbours stay neighbours — local structure is the contract, and everything global is negotiable. The result is excellent cluster pictures and a long list of standard misreadings:
These are visualisation instruments, not preprocessing steps: feeding t-SNE coordinates into a downstream model imports all of those distortions as features.
An autoencoder squeezes the data through a k-dimensional bottleneck and demands reconstruction: encoder f, decoder g, minimise ‖x − g(f(x))‖². The bottleneck forces the network to spend its k coordinates on whatever actually varies — a learned, nonlinear chart of the manifold. The connection to PCA is exact in the degenerate case: with linear f, g and squared error, the optimal bottleneck spans the top-k principal subspace. Depth and nonlinearity buy the ability to flatten curved manifolds, at the usual prices — training, tuning, and coordinates with no closed-form meaning. (Variational autoencoders add a probabilistic prior on the bottleneck; the KL note covers the term that does it.)
| You want | Reach for | It preserves | It sacrifices |
|---|---|---|---|
| Fast, interpretable compression; decorrelation; denoising | PCA | Global variance, distances (approx.), linear structure | Anything curved; discriminative low-variance directions |
| A 2D picture of cluster structure for human eyes | t-SNE / UMAP | Local neighbourhoods | Global distances, cluster sizes, densities, reproducibility |
| Compact features for a downstream model; curved manifolds | Autoencoder | Whatever reconstruction needs, nonlinearly | Interpretability, convexity, free lunch generally |
| Distance preservation with zero fitting (huge d, streaming) | Random projection | Pairwise distances (Johnson–Lindenstrauss) | Any notion of meaningful axes |
A serviceable default pipeline: PCA to ~50 dimensions first (cheap, removes noise and redundancy), then t-SNE/UMAP for looking, or the PCA coordinates themselves for modelling.