The rule is simple: what you can't profile you can't optimize. Do you want to squeeze more tokens per second out of an LLM, shave milliseconds off inference, or understand why your training loop runs slower than the spec sheet promises? At some point you have to open a profiler trace and read it.
But opening that trace often feels like walking into a wall of colored rectangles with intimidating names. This technical guide, the first in a series, lowers that entry ramp: we start with the simplest thing, a matrix multiply followed by a bias add, and learn to read what torch.profiler gives us.
Preparando el experimento y qué exporta el profiler
The function we use is minimal and pedagogical:
def fn(x, w, b): return torch.add(torch.matmul(x, w), b)
Annotating the region with torch.profiler.record_function('matmul_add') is optional but useful: it makes it easy to jump to that tag in traces. The typical profiling flow in the script is:
- definir la región a medir (aquí
fn) y envolverla enrecord_function. - ejecutar bajo
torch.profiler.profile(...)con actividades CPU y CUDA. - llamar
prof.key_averages().table(...)para la tabla estadística. - exportar la traza con
prof.export_chrome_trace(trace_path)para verla en Perfetto.
The profiler produces two complementary artifacts:
- La tabla del profiler: resumen estadístico que responde "qué se lleva la mayor parte del tiempo". Ideal para encontrar hotspots.
- La traza temporal (.json): vista tiempo-espacio de CPU vs GPU, con los kernels, huecos, y dependencias.
Ejemplo de comandos usados en la guía:
uv run 01_matmul_add.py --size 64
y para runs más grandes o con warmup/compile:
uv run 01_matmul_add.py --size 4096 --warmup --compile
Leer la tabla: Self vs Total y el síntoma de overhead
When you open the table you'll see columns like Self CPU time total and Self CUDA time total. Watch for this common signal:
- If
Self CPU time totalis in milliseconds andSelf CUDA time totalin microseconds, the typical pattern is overhead-bound: the CPU is doing more setup work than the GPU needs to compute.
In the size 64 example we observe something like:
- Self CPU time total: 2.314 ms
- Self CUDA time total: 23.104 us
That means the GPU is almost always idle: the small matmul runs fast on the GPU, but most time is spent launching kernels, allocating workspace, or running cuBLAS heuristics. Simple fixes: increase the work per call (bigger matrices or batching) or fuse operations.
With --size 4096 the balance changes:
- Self CPU time total: 4.908 ms
- Self CUDA time total: 4.495 ms
Both now in ms — we're in a compute-bound region where the GPU dominates: the usual goal when scaling to real models.
Leer la traza: CPU lane, GPU lane y las ventanas muertas
The trace shows two main lanes: CPU and GPU. Key elements to look for and what they mean:
-
A gap between the
record_function('...')entry and the firstaten::*is a "dead window". It can be workspace allocation, cuBLAS heuristics, or lazy loading. What to do? Warmup. -
The profiler scheduler controls which steps are recorded. A common schedule example:
torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)
wait skips initializations; warmup runs without recording; active is what you see in the trace.
-
If you see an offset of several ms between when the CPU submits work and when the GPU starts, you may see an entry called "Activity Buffer Request" on the GPU lane: the profiler referring to its event buffer or a memory request.
-
If one step is much wider than the others (the first ProfileStep, for example) it's cold-start: add warmup before measuring.
Señales en la traza y llamadas CUDA que importan
Look for these calls and keep their meaning in mind:
-
cudaOccupancyMaxActiveBlocksPerMultiprocessorbefore acudaLaunchKernel: it's a planning query. It usually appears for heavy kernels like GEMM or convolutions. It means cuBLAS or the runtime is choosing a heuristic variant. -
cudaLaunchKernelwithout the occupancy query: typically elementwise kernels or reductions with fixed, light footprints. -
cudaDeviceSynchronizeat the end of the active window: the profiler forces a sync to flush events. If that sync covers microseconds of GPU work but milliseconds of synchronization, it's another sign of overhead-bound behavior. -
cudaMemcpyAsyncthat you didn't write: often a Device-to-Device copy preparing memory before a GEMM with epilogue (for example bias add). Inductor or cuBLAS may perform that memcpy implicitly.
Variabilidad en tiempos de kernel
Even with the same kernel and the same data, you will see variability between invocations. Common causes:
- GPU clock boost after idle periods.
- Thermal throttling or power management.
- Driver housekeeping tasks.
The moral: don't trust just the mean. The trace shows the real distribution.
Qué cambia con torch.compile y qué no
torch.compile (for example with the default TorchInductor backend) captures execution fragments and generates code. General pipeline summary:
- TorchDynamo captures Python execution into an FX graph.
- AOTAutograd prepares forward/backward when gradients exist.
- Inductor lowers the graph to optimized CPU/GPU code.
In the trace you'll see rows like "Torch-Compiled Region" and calls to TorchDynamo Cache Lookup on each step: Dynamo checks if shapes and metadata match the cache. That check is paid on every call.
An important detail from the example:
- Inductor can fuse at the dispatcher level: your
torch.add(torch.matmul(x, w), b)can becomeaten::addmm(b, x, w). - However, that doesn't guarantee a single new GPU kernel. In the 4096x4096 bf16 case, Inductor chose to use cuBLAS and arranged the bias add as an epilogue of the GEMM, accompanied by a
Memcpy DtoDto prepare the output buffer.
Practical result: compile can increase the per-call CPU cost (Dynamo + AOTAutograd + Inductor) for very small ops. For models with dozens or hundreds of ops that overhead is amortized. For an isolated operation, it's a tax that eats time.
Diagnóstico rápido: señales y cómo interpretarlas
Use this as a checklist when you look at a trace:
-
Self CPU time total much greater than Self CUDA time total (CPU ms vs GPU us): overhead-bound. Make the work larger or fuse.
-
Self CPU ≈ Self CUDA in ms: compute-bound. GPU is the bottleneck.
-
One event dominates
cuda_time_total: start optimizing there. -
An event with many calls (# of Calls large): potential for batching or fusion.
-
cudaOccupancyMaxActiveBlocksPerMultiprocessorappears: adaptive heavy kernel (GEMM, conv). -
Unexpected
cudaMemcpyAsyncbefore a GEMM: DtoD copy for addmm epilogue. -
Activity Buffer Request on the GPU lane: the profiler or runtime is requesting/filling buffers and can explain offsets between lanes.
-
Variation in times of the same kernel across steps: read the full trace, not just the mean.
Recomendaciones prácticas rápidas
-
Start profiling with simple cases and watch how metrics change as you increase size and batching.
-
Use warmup (both outside and inside the profiler) to remove cold-start costs.
-
If the CPU dominates, try fusing operators or moving work to the GPU in larger chunks.
-
Inspect the full trace in Perfetto to find gaps and implicit memcpys; the table alone won't tell the whole story.
-
When using
torch.compile, inspect the CPU tree: Dynamo cache hits, AOT wrapper, and theCall CompiledFxGraphwill tell you how much CPU is being paid.
After finishing this first exercise with a matmul + bias you already have several lenses to read more complex traces: distinguish overhead from compute, identify cuBLAS planning, understand implicit copies, and see what torch.compile really does. In upcoming posts we'll raise the complexity: first nn.Linear and an MLP, then large models with transformers. The idea is to build intuition step by step so when you open a trace it's not terror but detective curiosity.
