Applied ML

Communication Primitives

The eight verbs distributed training compiles to

01 · First principlesWhy a vocabulary at all

The moment training spans more than one GPU, some tensor on one device is needed on another, and the interconnect — not the arithmetic — becomes a thing you must budget for. NVLink moves hundreds of GB/s; cross-node Ethernet often moves tens. A 7B model's gradients are about 14 GB in bf16; shipping them naively, every step, is the kind of bill that quietly halves your throughput.

Rather than reasoning about raw sends and receives, every framework (NCCL underneath PyTorch, the XLA collectives underneath JAX) speaks a fixed vocabulary of collectives: operations in which every rank participates at once. DDP, FSDP, tensor parallelism, and pipeline parallelism are, at the wire level, just different sentences built from these eight verbs.

02 · The vocabularyWhat moves, who ends up with what

PrimitiveWhat movesWho has what afterWhere ML uses it
broadcastOne rank's tensor to allEveryone has the same full copyInitial weight sync in DDP
scatterOne rank's tensor, split into N slicesEach rank has a different sliceDistributing data shards
gatherEach rank's slice to one rankOne rank has the concatenationCollecting eval metrics to rank 0
allgatherEach rank's slice to every rankEveryone has the full concatenationFSDP parameter materialisation
reduceEveryone's tensor, summed, to one rankOne rank has the sumRare in training loops
reduce-scatterEveryone's tensor, summed, slicedEach rank has a different 1/N of the sumFSDP / ZeRO gradient sync
allreduceEveryone's tensor, summed, to allEveryone has the full sumDDP gradient sync, TP activations
all-to-allSlice (i,j) goes from rank i to rank jEveryone holds one slice from everyoneMoE expert routing
Reading the table: the "all" prefix means every rank ends up with the result; without it, one root rank does. The pairs (gather, allgather) and (reduce, allreduce) differ only in that final fan-out.

03 · The identityAllreduce = reduce-scatter + allgather

One algebraic fact carries most of modern memory-efficient training. An allreduce can be performed in two phases: first a reduce-scatter, after which each rank holds a fully summed 1/N slice; then an allgather, which reassembles the full summed tensor on every rank.

allreduce(g)  =  reduce_scatter(g)  →  allgather(·)
each rank: summed slice each rank: full sum

This is not just an implementation detail; it is the seam that ZeRO cuts along. DDP runs both phases back to back inside one call. ZeRO stops after the first phase, lets each rank update only the slice it owns, and defers (or repurposes) the second phase. Sharded training therefore pays essentially the same communication volume DDP already paid — the redundancy was in the storage, not the wire.

04 · The mechanismRing allreduce

The naive allreduce — everyone sends their whole tensor to everyone — moves N× the data and melts at scale. The ring algorithm arranges the N ranks in a circle and splits the tensor into N chunks. In the reduce-scatter phase, each rank repeatedly sends one chunk to its right neighbour and adds the chunk arriving from its left; after N−1 steps every rank holds one fully summed chunk. The allgather phase circulates those finished chunks for another N−1 steps.

RING ALLREDUCE · N = 4 · ONE STEP OF THE REDUCE-SCATTER PHASE R0 R1 R2 R3 c0 c1 c2 c3 EACH RANK SENDS ONE CHUNK, ADDS THE ONE ARRIVING

Every link carries size/N bytes per step, all links busy simultaneously. After 2(N−1) steps the allreduce is done.

Count the traffic. Each rank sends (N−1) chunks of size/N in each phase, so total bytes sent per rank:

bytes/rank  =  2 · size · (N−1)/N  ≈  2 · size

The factor (N−1)/N approaches 1, so per-rank traffic is roughly 2× the tensor size regardless of how many GPUs participate. That near-independence from N is the entire reason data parallelism scales: adding ranks adds compute without adding per-rank gradient traffic. The cost that does grow with N is latency — 2(N−1) sequential steps — which is why rings hurt for small tensors and why tree and hierarchical variants exist for large clusters.

05 · Reading systemsParallelism strategies as sentences

Habit worth building: when a new distributed scheme appears, translate it into this vocabulary first. The memory and bandwidth bills follow mechanically from which verbs it uses, on which tensors, how often.
Mental Model