Find the wall before you push on it
The standard failure is not slow code; it is a week spent making a fast part faster. Someone hand-fuses a kernel that was 2% of the step, while the data loader starves the GPU for 30% of every iteration. Intuition about where time goes in a GPU program is poor, partly because execution is asynchronous: the Python line that "takes long" is often just the first one forced to wait for a queue of earlier kernels. Measurement is not a refinement of optimisation; it is the precondition for it.
| Regime | The wall | Telltale signs | What actually helps |
|---|---|---|---|
| Compute-bound | FLOP/s of the chip | Big matmuls dominate the trace; tensor-core utilisation high; achieved TFLOPS near spec | Lower precision, better kernels, a smaller model. You are at the happy wall. |
| Bandwidth-bound | HBM GB/s | Trace full of short pointwise/norm kernels; SMs idle waiting on memory | Fusion (torch.compile), fewer/larger ops, 16-bit activations |
| Comms / overhead-bound | Interconnect, CPU, launch latency | Gaps between kernels; NCCL kernels on the critical path; GPU idle while Python or the dataloader works | Overlap comms with compute, no_sync, CUDA graphs / compile, more dataloader workers |
Which wall a kernel hits is predictable before you run it. Define arithmetic intensity as the work done per byte moved to and from memory:
The chip has a fixed ratio too: an A100 offers roughly 312 bf16 TFLOPS against roughly 2 TB/s of HBM, a ridge point near 150 FLOPs/byte. Kernels below that intensity cannot be compute-bound no matter how clever the code; the memory system simply cannot feed the ALUs fast enough.
Under the slanted roof, only moving fewer bytes helps. On the flat roof, only doing fewer FLOPs (or lower precision) helps.
This single picture explains the modern kernel agenda. A pointwise op does 1 FLOP per element while moving 8–12 bytes — AI well below 1, hopelessly bandwidth-bound, running at perhaps 1% of peak FLOPS. A chain of such ops (bias, GeLU, dropout, residual) re-reads and re-writes the same tensor over and over. Fusion merges the chain into one kernel that reads once, computes everything in registers, and writes once — the FLOPs are unchanged and the kernel still finishes several times faster, because FLOPs were never the cost. FlashAttention is the same logic applied to attention's memory traffic.
| Tool | What it shows | Reach for it when |
|---|---|---|
Timer + torch.cuda.synchronize() | Wall time of a region, honestly | Always first; one number, no setup. Without the synchronize you are timing kernel launches, not kernels |
torch.profiler | Per-op CPU and CUDA time, exportable Chrome trace, stacks | Attributing a step's time to ops; spotting gaps and launch overhead in the timeline |
| Nsight Systems / Compute | Whole-system timeline (CPU, GPU, NCCL, dataloader); per-kernel hardware counters | Cross-process and comms problems; confirming a specific kernel's achieved bandwidth or occupancy |
torch.cuda.memory._record_memory_history snapshot | Every allocation with stack traces, as an interactive timeline | OOMs and mystery memory growth; seeing what actually peaks (usually activations — see checkpointing) |
Two habits prevent most measurement lies: discard the first iterations (compile, autotune, and allocator warmup pollute them — see JIT), and measure several steps of the real workload rather than a microbenchmark with the dataloader removed and caches hot.
For training, the cleanest top-line metric is MFU (model FLOPs utilisation): the model's theoretical FLOPs per step divided by step time, as a fraction of the chip's peak. Large transformer runs commonly land around 35–50%; well below that, the roofline says you are paying one of the other two walls, and the trace tells you which. Throughput in tokens/s is what you ship; MFU is what tells you how much is left on the table.