General ML

CNNs

What images already know, baked into the architecture

01 · First principlesWhat structure do images have?

An image is not an arbitrary vector of numbers; it has two enormous regularities. Locality: a pixel's meaning is determined almost entirely by its neighbours — an edge, a corner, a patch of fur are all small local patterns. Translation structure: a cat in the top-left corner and the same cat in the bottom-right are the same cat; the statistics of natural images are (approximately) the same everywhere in the frame.

A model that knows these two facts in advance does not have to learn them from data. A model that does not know them must spend data rediscovering them. CNNs are what you get when you build both facts directly into the wiring.

02 · Failure firstThe dense layer, and the count that kills it

The naive approach: flatten the image and feed it to a fully connected layer. Count the parameters. A modest 224×224 RGB image has 224 × 224 × 3 = 150,528 inputs; a first hidden layer of just 1,000 units needs

150,528 × 1,000 ≈ 1.5 × 10⁸ weights  —  in the first layer alone

Worse than the count is what the layer ignores. Every unit connects to every pixel, so the architecture treats "pixel (3, 7)" and "pixel (210, 198)" as unrelated coordinates — a permutation of all pixels would look identical to it. And a pattern learned at one location says nothing about the same pattern elsewhere: the cat must be re-learned at every position it might appear. The dense layer wastes its parameters refusing both gifts the data offers.

03 · The fixOne small filter, slid everywhere

Convolution is the repair, and it is one idea applied twice. Make each unit look only at a small window (local receptive field — exploiting locality), and use the same weights for every window position (weight sharing — exploiting translation structure). A 3×3×3 filter has 27 weights instead of 150 million, and those 27 weights scan the whole image:

y[i, j] = Σa Σb w[a, b] · x[i+a, j+b]   —  same w at every (i, j)

Weight sharing buys translation equivariance: shift the input and the feature map shifts with it, by construction rather than by training. A filter that detects a vertical edge detects it everywhere, having learned it anywhere.

INPUT 7×7 FEATURE MAP 5×5 same 3×3 filter, sliding → w · x one output per window position

27 shared weights replace 150M: the filter visits every position, so a pattern learned anywhere is detected everywhere.

Inductive bias as compressed knowledge: locality and translation symmetry are true facts about images, donated to the model for free instead of estimated from data. No-free-lunch logic says every learner must assume something; CNNs simply assume the right things for vision — which is also the warning, because where the assumption is wrong the bias becomes a cage.

04 · The stackReceptive fields, pooling, hierarchy

One 3×3 filter sees almost nothing. The stack fixes this: each layer's units see a 3×3 window of the previous layer's outputs, so the region of the original image influencing a unit — its receptive field — grows with depth. Pooling (or stride-2 convolution) downsamples feature maps along the way, which both cuts compute and accelerates that growth, while adding a little local translation invariance: a max over a 2×2 window does not care where in the window the feature fired.

What the layers learn, given this wiring, is a hierarchy that was hoped for and then actually observed in trained filters: edges → textures → parts → objects. Layer 1 filters of nearly every vision CNN converge to oriented edges and colour blobs (close to Gabor filters); middle layers respond to motifs and textures; late layers to wheels, faces, whole categories. Composition is the point: a wheel is an arrangement of curves, which are arrangements of edges.

ResNets in one line: let each block learn a residual correction, y = x + F(x), so identity is the default and gradients flow through the skip path — which is what allowed stacks of one hundred layers instead of twenty (the same additive-highway trick that gates give LSTMs).

05 · The tradeoffWhen the bias helps, when it caps you

A vision transformer treats an image as a bag of patches and learns its spatial relationships from scratch; a CNN is told them. The consequence is a clean data-dependent crossover:

RegimeWinnerWhy
Small / medium data (≤ ~1M images), edge deploymentCNNThe built-in assumptions substitute for the data you do not have; convolutions are cheap and hardware-friendly.
Internet-scale pretraining (hundreds of millions of images)ViTWith enough data, learned spatial relations outgrow hard-coded ones — global attention finds long-range structure convolution cannot. Assumptions stop being a gift and start being a ceiling.

This is the same story told in SVMs (fixed kernels lost to learned features) and it rhymes with transfer learning: at scale, the bitter-lesson pattern is that learned structure beats designed structure — but only at scale.

Mental Model