Applied ML

Pipeline Parallelism

Split by depth, then fight the bubble

01 · First principlesSplit where the model is already thin

A deep network is a chain: layer k consumes exactly what layer k−1 produced. That makes depth the cheapest place to cut. Give the first third of the layers to GPU 0, the next third to GPU 1, and so on; the only thing that ever crosses a device boundary is one activation tensor per cut, point-to-point, between neighbours.

Compare the communication appetites: tensor parallelism allreduces activations inside every block; FSDP gathers parameters for every layer. Pipeline parallelism sends one tensor per stage boundary. It is the only parallelism cheap enough to be comfortable across slow inter-node links — which is exactly where it is used.

02 · Failure firstThe naive schedule idles almost everyone

Run one batch through p stages naively and the chain dependency bites: while stage 0 computes, stages 1..p−1 wait for input; while stage p−1 computes, everyone else has nothing to do. Forward marches down, backward marches back up, and at any instant exactly one device is busy. With p stages, utilisation is 1/p — eight GPUs doing the work of one, expensively.

The idle area in the schedule diagram is called the bubble, and shrinking it is the entire subject.

03 · The mechanismMicrobatches fill the bubble

GPipe's fix: split the batch into m microbatches and stream them. As soon as stage 0 finishes microbatch 1 it starts microbatch 2, while stage 1 works on microbatch 1. Gradients from all microbatches accumulate (see gradient accumulation — same mathematics) and the optimizer steps once at the end, so the result is identical to the unsplit batch.

Top: one batch, one device busy at a time. Bottom: four microbatches keep the ramp-up and drain at the edges and fill the middle with work.

Only the ramp-up and drain remain idle, and their share is exact:

bubble fraction = (p − 1) / (m + p − 1)

p stagesm microbatches

With p = 8 and m = 32 the bubble is 7/39 ≈ 18%; push m to 64 and it is under 9%. The lever is m, and its limit is that microbatches cannot shrink forever — too small and each stage's kernels stop saturating the GPU.

04 · Second failureGPipe hoards activations; 1F1B releases them

GPipe runs all forwards, then all backwards. Every stage must therefore hold activations for all m in-flight microbatches at once — activation memory grows linearly in m, the very knob we wanted to turn up.

The 1F1B schedule interleaves instead: after warm-up, each stage alternates one forward and one backward, so a microbatch's activations are freed by its backward soon after its forward. In-flight activations drop from m microbatches to at most p, while the bubble fraction stays the same. This is why 1F1B (and its interleaved variants) is the production default and GPipe is mostly pedagogy. Checkpointing within each stage stacks on top for further activation savings.

05 · The costsWhat pipeline parallelism never fixes

The bubble never reaches zero. Some fraction of every step is structurally idle; you pay it forever, and it grows whenever m must shrink (small global batches make PP unattractive).
Load balance is your problem. Stages must take equal time or everyone waits for the slowest; embeddings, the loss layer, and uneven layer counts make balancing real engineering rather than an afterthought.
Implementation complexity. Schedules, in-flight activation bookkeeping, and the interaction with the optimizer step make PP the most intrusive of the parallelisms to a codebase.

Placement rule: PP communicates least, so it goes across the slowest links — between nodes — while TP stays inside the node and data parallelism (DDP/FSDP) spans the remaining axis.

Mental Model

Cutting by depth moves the least data of any parallelism: one activation tensor per boundary, neighbour to neighbour.
The chain dependency creates the bubble; microbatches do not remove it, they amortise it: (p−1)/(m+p−1).
GPipe buys utilisation with activation memory; 1F1B gets the same bubble while holding at most p microbatches in flight.
The schedule diagram is the tool: every PP idea is a rearrangement of F and B cells on that grid.
PP belongs on the slow links; its real ongoing costs are the residual bubble and stage balancing.