Applied ML

Profiling

Find the wall before you push on it

01 · First principlesNever optimise unmeasured

The standard failure is not slow code; it is a week spent making a fast part faster. Someone hand-fuses a kernel that was 2% of the step, while the data loader starves the GPU for 30% of every iteration. Intuition about where time goes in a GPU program is poor, partly because execution is asynchronous: the Python line that "takes long" is often just the first one forced to wait for a queue of earlier kernels. Measurement is not a refinement of optimisation; it is the precondition for it.

The discipline: measure → attribute the time to one of three walls → apply only the fixes that move that wall → measure again. Every fix in the systems-ML toolbox addresses exactly one wall, which is why misdiagnosis wastes the whole effort.

02 · The mapThree regimes

Regime	The wall	Telltale signs	What actually helps
Compute-bound	FLOP/s of the chip	Big matmuls dominate the trace; tensor-core utilisation high; achieved TFLOPS near spec	Lower precision, better kernels, a smaller model. You are at the happy wall.
Bandwidth-bound	HBM GB/s	Trace full of short pointwise/norm kernels; SMs idle waiting on memory	Fusion (torch.compile), fewer/larger ops, 16-bit activations
Comms / overhead-bound	Interconnect, CPU, launch latency	Gaps between kernels; NCCL kernels on the critical path; GPU idle while Python or the dataloader works	Overlap comms with compute, no_sync, CUDA graphs / compile, more dataloader workers

03 · The modelArithmetic intensity and the roofline

Which wall a kernel hits is predictable before you run it. Define arithmetic intensity as the work done per byte moved to and from memory:

AI = FLOPs / bytes moved · attainable FLOP/s = min(peak FLOP/s, AI × bandwidth)

The chip has a fixed ratio too: an A100 offers roughly 312 bf16 TFLOPS against roughly 2 TB/s of HBM, a ridge point near 150 FLOPs/byte. Kernels below that intensity cannot be compute-bound no matter how clever the code; the memory system simply cannot feed the ALUs fast enough.

Under the slanted roof, only moving fewer bytes helps. On the flat roof, only doing fewer FLOPs (or lower precision) helps.

This single picture explains the modern kernel agenda. A pointwise op does 1 FLOP per element while moving 8–12 bytes — AI well below 1, hopelessly bandwidth-bound, running at perhaps 1% of peak FLOPS. A chain of such ops (bias, GeLU, dropout, residual) re-reads and re-writes the same tensor over and over. Fusion merges the chain into one kernel that reads once, computes everything in registers, and writes once — the FLOPs are unchanged and the kernel still finishes several times faster, because FLOPs were never the cost. FlashAttention is the same logic applied to attention's memory traffic.

04 · The toolsFrom humble timer to full trace

Tool	What it shows	Reach for it when
Timer + `torch.cuda.synchronize()`	Wall time of a region, honestly	Always first; one number, no setup. Without the synchronize you are timing kernel launches, not kernels
`torch.profiler`	Per-op CPU and CUDA time, exportable Chrome trace, stacks	Attributing a step's time to ops; spotting gaps and launch overhead in the timeline
Nsight Systems / Compute	Whole-system timeline (CPU, GPU, NCCL, dataloader); per-kernel hardware counters	Cross-process and comms problems; confirming a specific kernel's achieved bandwidth or occupancy
`torch.cuda.memory._record_memory_history` snapshot	Every allocation with stack traces, as an interactive timeline	OOMs and mystery memory growth; seeing what actually peaks (usually activations — see checkpointing)

Two habits prevent most measurement lies: discard the first iterations (compile, autotune, and allocator warmup pollute them — see JIT), and measure several steps of the real workload rather than a microbenchmark with the dataloader removed and caches hot.

05 · Reading resultsThe one number worth reporting

For training, the cleanest top-line metric is MFU (model FLOPs utilisation): the model's theoretical FLOPs per step divided by step time, as a fraction of the chip's peak. Large transformer runs commonly land around 35–50%; well below that, the roofline says you are paying one of the other two walls, and the trace tells you which. Throughput in tokens/s is what you ship; MFU is what tells you how much is left on the table.

Order of inspection: step time stable? → GPU busy (gaps = overhead/comms)? → busy time in matmuls (else bandwidth)? → matmuls near peak (else kernel/precision)? Four questions, asked in order, classify almost every slow training job.

Mental Model

Time goes to one of three walls: FLOPs, memory bandwidth, or overhead/comms — and every optimisation targets exactly one.
Arithmetic intensity decides the wall in advance; the ridge sits near 150 FLOPs/byte on current hardware.
Pointwise ops live at AI < 1: bandwidth-bound by construction, which is why fusion (same FLOPs, fewer bytes) is the dominant kernel optimisation.
Async execution lies to timers; synchronize before trusting any number, and skip warmup steps.
MFU is the honesty metric: ~40% is normal for big transformers, and the gap to it is your to-do list.