Applied ML

Pipeline Parallelism

Split by depth, then fight the bubble

01 · First principlesSplit where the model is already thin

A deep network is a chain: layer k consumes exactly what layer k−1 produced. That makes depth the cheapest place to cut. Give the first third of the layers to GPU 0, the next third to GPU 1, and so on; the only thing that ever crosses a device boundary is one activation tensor per cut, point-to-point, between neighbours.

Compare the communication appetites: tensor parallelism allreduces activations inside every block; FSDP gathers parameters for every layer. Pipeline parallelism sends one tensor per stage boundary. It is the only parallelism cheap enough to be comfortable across slow inter-node links — which is exactly where it is used.

02 · Failure firstThe naive schedule idles almost everyone

Run one batch through p stages naively and the chain dependency bites: while stage 0 computes, stages 1..p−1 wait for input; while stage p−1 computes, everyone else has nothing to do. Forward marches down, backward marches back up, and at any instant exactly one device is busy. With p stages, utilisation is 1/p — eight GPUs doing the work of one, expensively.

The idle area in the schedule diagram is called the bubble, and shrinking it is the entire subject.

03 · The mechanismMicrobatches fill the bubble

GPipe's fix: split the batch into m microbatches and stream them. As soon as stage 0 finishes microbatch 1 it starts microbatch 2, while stage 1 works on microbatch 1. Gradients from all microbatches accumulate (see gradient accumulation — same mathematics) and the optimizer steps once at the end, so the result is identical to the unsplit batch.

NAIVE · p=4 STAGES, ONE BATCH (F = FORWARD, B = BACKWARD ≈ 2×) GPU0GPU1GPU2GPU3 ~75% IDLE (BUBBLE) GPIPE · p=4, m=4 MICROBATCHES GPU0GPU1GPU2GPU3 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ■ FORWARD ■ BACKWARD BUBBLE SHRINKS AS m GROWS · OPTIMIZER STEPS AT DASHED LINE

Top: one batch, one device busy at a time. Bottom: four microbatches keep the ramp-up and drain at the edges and fill the middle with work.

Only the ramp-up and drain remain idle, and their share is exact:

bubble fraction  =  (p − 1) / (m + p − 1)
p stagesm microbatches

With p = 8 and m = 32 the bubble is 7/39 ≈ 18%; push m to 64 and it is under 9%. The lever is m, and its limit is that microbatches cannot shrink forever — too small and each stage's kernels stop saturating the GPU.

04 · Second failureGPipe hoards activations; 1F1B releases them

GPipe runs all forwards, then all backwards. Every stage must therefore hold activations for all m in-flight microbatches at once — activation memory grows linearly in m, the very knob we wanted to turn up.

The 1F1B schedule interleaves instead: after warm-up, each stage alternates one forward and one backward, so a microbatch's activations are freed by its backward soon after its forward. In-flight activations drop from m microbatches to at most p, while the bubble fraction stays the same. This is why 1F1B (and its interleaved variants) is the production default and GPipe is mostly pedagogy. Checkpointing within each stage stacks on top for further activation savings.

05 · The costsWhat pipeline parallelism never fixes

Placement rule: PP communicates least, so it goes across the slowest links — between nodes — while TP stays inside the node and data parallelism (DDP/FSDP) spans the remaining axis.
Mental Model