Skip the noise — learn the wind that blows noise into data
Diffusion already taught us that a deterministic ODE — the probability-flow ODE — can transport noise to data just as well as the stochastic process can. But there we obtained the ODE indirectly: first define a noising SDE, then learn its score, then convert. Flow matching asks the blunt question: if a velocity field is all we need in the end, why not learn the velocity field directly?
The object to learn is vθ(x, t): a time-dependent vector field, the "wind" at every point of space. Sampling means dropping a noise particle at t = 0 and letting the wind carry it:
No Brownian motion, no thousand corrector steps, no variance schedules. One ODE, integrated from 0 to 1.
The obvious training target: pick a probability path pt interpolating noise to data, with true marginal velocity ut(x), and regress on it:
This is circular in the same way naive score matching was: ut(x) is the velocity of the marginal flow, which at any point x mixes the contributions of all data points whose paths pass nearby. Writing it down requires an integral over the whole dataset weighted by intractable posterior probabilities. We cannot evaluate the target even once.
Condition on a single data point x₁ and the impossible becomes trivial. Choose, per sample, the simplest conceivable path from a noise draw x₀ to x₁ — a straight line:
Then regress the network on this per-sample velocity:
The theorem that makes this legitimate: ∇θ LCFM = ∇θ LFM. The two losses differ only by a constant independent of θ, so they have identical gradients and identical minimisers. The argument takes three lines: expand both squares; the ‖vθ‖² terms match because xt has the same marginal either way; the cross terms match because the marginal velocity is, by definition, the conditional expectation E[x₁ − x₀ | xt = x]; what remains is θ-free. Regressing on a noisy-but-unbiased target trains the same network as regressing on the unobtainable clean one — the same move that rescued denoising score matching.
An Euler step follows the tangent. On a curved path the tangent leaves the path, so steps must be small; on a straight path the tangent is the path, so one step is already exact.
This is the practical heart of the method. Numerical integrators err in proportion to the curvature of the trajectory: each Euler step walks along the tangent, and curvature is precisely how fast the tangent lies. Diffusion's probability-flow trajectories are curved (the marginal velocity bends where paths cross), so they need tens of steps. Conditional straight-line paths keep the marginal flow only mildly curved — and rectified flow finishes the job: generate (noise, sample) pairs with a trained model, retrain on straight lines between those now-coupled pairs, and the paths straighten further. After a round or two, one to four steps suffice.
Flow matching is not a rival theory; it is the same continuous-time picture entered through a different door. Choose a Gaussian probability path of a particular variance schedule and the marginal velocity field is an affine function of the score — the probability-flow ODE of a VP diffusion drops out as a special case. What changes is the parameterisation and the default geometry: predict velocity rather than noise, prefer straight conditional paths rather than variance-preserving curves, train by simulation-free regression either way.
| Diffusion (score) | Flow matching | |
|---|---|---|
| Learned field | ∇x log pt(x) | velocity v(x, t) |
| Conditional target | −ε/σ (the added noise) | x₁ − x₀ (a constant) |
| Default path shape | curved (VP schedule) | straight lines |
| Sampling cost driver | integrator order × curvature | same — but curvature is engineered down |