Split by depth, then fight the bubble
A deep network is a chain: layer k consumes exactly what layer k−1 produced. That makes depth the cheapest place to cut. Give the first third of the layers to GPU 0, the next third to GPU 1, and so on; the only thing that ever crosses a device boundary is one activation tensor per cut, point-to-point, between neighbours.
Compare the communication appetites: tensor parallelism allreduces activations inside every block; FSDP gathers parameters for every layer. Pipeline parallelism sends one tensor per stage boundary. It is the only parallelism cheap enough to be comfortable across slow inter-node links — which is exactly where it is used.
Run one batch through p stages naively and the chain dependency bites: while stage 0 computes, stages 1..p−1 wait for input; while stage p−1 computes, everyone else has nothing to do. Forward marches down, backward marches back up, and at any instant exactly one device is busy. With p stages, utilisation is 1/p — eight GPUs doing the work of one, expensively.
The idle area in the schedule diagram is called the bubble, and shrinking it is the entire subject.
GPipe's fix: split the batch into m microbatches and stream them. As soon as stage 0 finishes microbatch 1 it starts microbatch 2, while stage 1 works on microbatch 1. Gradients from all microbatches accumulate (see gradient accumulation — same mathematics) and the optimizer steps once at the end, so the result is identical to the unsplit batch.
Top: one batch, one device busy at a time. Bottom: four microbatches keep the ramp-up and drain at the edges and fill the middle with work.
Only the ramp-up and drain remain idle, and their share is exact:
With p = 8 and m = 32 the bubble is 7/39 ≈ 18%; push m to 64 and it is under 9%. The lever is m, and its limit is that microbatches cannot shrink forever — too small and each stage's kernels stop saturating the GPU.
GPipe runs all forwards, then all backwards. Every stage must therefore hold activations for all m in-flight microbatches at once — activation memory grows linearly in m, the very knob we wanted to turn up.
The 1F1B schedule interleaves instead: after warm-up, each stage alternates one forward and one backward, so a microbatch's activations are freed by its backward soon after its forward. In-flight activations drop from m microbatches to at most p, while the bubble fraction stays the same. This is why 1F1B (and its interleaved variants) is the production default and GPipe is mostly pedagogy. Checkpointing within each stage stacks on top for further activation savings.