Profiling in PyTorch: an initial guide to torch.profiler

May 29, 20266 minutes

The rule is simple: what you can't profile you can't optimize. Do you want to squeeze more tokens per second out of an LLM, shave milliseconds off inference, or understand why your training loop runs slower than the spec sheet promises? At some point you have to open a profiler trace and read it.

But opening that trace often feels like walking into a wall of colored rectangles with intimidating names. This technical guide, the first in a series, lowers that entry ramp: we start with the simplest thing, a matrix multiply followed by a bias add, and learn to read what torch.profiler gives us.

Preparando el experimento y qué exporta el profiler

The function we use is minimal and pedagogical:

def fn(x, w, b): return torch.add(torch.matmul(x, w), b)

Annotating the region with torch.profiler.record_function('matmul_add') is optional but useful: it makes it easy to jump to that tag in traces. The typical profiling flow in the script is:

definir la región a medir (aquí fn) y envolverla en record_function.
ejecutar bajo torch.profiler.profile(...) con actividades CPU y CUDA.
llamar prof.key_averages().table(...) para la tabla estadística.
exportar la traza con prof.export_chrome_trace(trace_path) para verla en Perfetto.

The profiler produces two complementary artifacts:

La tabla del profiler: resumen estadístico que responde "qué se lleva la mayor parte del tiempo". Ideal para encontrar hotspots.
La traza temporal (.json): vista tiempo-espacio de CPU vs GPU, con los kernels, huecos, y dependencias.

Ejemplo de comandos usados en la guía:

uv run 01_matmul_add.py --size 64

y para runs más grandes o con warmup/compile:

uv run 01_matmul_add.py --size 4096 --warmup --compile

Leer la tabla: Self vs Total y el síntoma de overhead

When you open the table you'll see columns like Self CPU time total and Self CUDA time total. Watch for this common signal:

If Self CPU time total is in milliseconds and Self CUDA time total in microseconds, the typical pattern is overhead-bound: the CPU is doing more setup work than the GPU needs to compute.

In the size 64 example we observe something like:

Self CPU time total: 2.314 ms
Self CUDA time total: 23.104 us

That means the GPU is almost always idle: the small matmul runs fast on the GPU, but most time is spent launching kernels, allocating workspace, or running cuBLAS heuristics. Simple fixes: increase the work per call (bigger matrices or batching) or fuse operations.

With --size 4096 the balance changes:

Self CPU time total: 4.908 ms
Self CUDA time total: 4.495 ms

Both now in ms — we're in a compute-bound region where the GPU dominates: the usual goal when scaling to real models.

Leer la traza: CPU lane, GPU lane y las ventanas muertas

The trace shows two main lanes: CPU and GPU. Key elements to look for and what they mean:

A gap between the record_function('...') entry and the first aten::* is a "dead window". It can be workspace allocation, cuBLAS heuristics, or lazy loading. What to do? Warmup.
The profiler scheduler controls which steps are recorded. A common schedule example:

torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)

wait skips initializations; warmup runs without recording; active is what you see in the trace.

If you see an offset of several ms between when the CPU submits work and when the GPU starts, you may see an entry called "Activity Buffer Request" on the GPU lane: the profiler referring to its event buffer or a memory request.
If one step is much wider than the others (the first ProfileStep, for example) it's cold-start: add warmup before measuring.

Señales en la traza y llamadas CUDA que importan

Look for these calls and keep their meaning in mind:

cudaOccupancyMaxActiveBlocksPerMultiprocessor before a cudaLaunchKernel: it's a planning query. It usually appears for heavy kernels like GEMM or convolutions. It means cuBLAS or the runtime is choosing a heuristic variant.
cudaLaunchKernel without the occupancy query: typically elementwise kernels or reductions with fixed, light footprints.
cudaDeviceSynchronize at the end of the active window: the profiler forces a sync to flush events. If that sync covers microseconds of GPU work but milliseconds of synchronization, it's another sign of overhead-bound behavior.
cudaMemcpyAsync that you didn't write: often a Device-to-Device copy preparing memory before a GEMM with epilogue (for example bias add). Inductor or cuBLAS may perform that memcpy implicitly.

Variabilidad en tiempos de kernel

Even with the same kernel and the same data, you will see variability between invocations. Common causes:

GPU clock boost after idle periods.
Thermal throttling or power management.
Driver housekeeping tasks.

The moral: don't trust just the mean. The trace shows the real distribution.

Qué cambia con `torch.compile` y qué no

torch.compile (for example with the default TorchInductor backend) captures execution fragments and generates code. General pipeline summary:

TorchDynamo captures Python execution into an FX graph.
AOTAutograd prepares forward/backward when gradients exist.
Inductor lowers the graph to optimized CPU/GPU code.

In the trace you'll see rows like "Torch-Compiled Region" and calls to TorchDynamo Cache Lookup on each step: Dynamo checks if shapes and metadata match the cache. That check is paid on every call.

An important detail from the example:

Inductor can fuse at the dispatcher level: your torch.add(torch.matmul(x, w), b) can become aten::addmm(b, x, w).
However, that doesn't guarantee a single new GPU kernel. In the 4096x4096 bf16 case, Inductor chose to use cuBLAS and arranged the bias add as an epilogue of the GEMM, accompanied by a Memcpy DtoD to prepare the output buffer.

Practical result: compile can increase the per-call CPU cost (Dynamo + AOTAutograd + Inductor) for very small ops. For models with dozens or hundreds of ops that overhead is amortized. For an isolated operation, it's a tax that eats time.

Diagnóstico rápido: señales y cómo interpretarlas

Use this as a checklist when you look at a trace:

Self CPU time total much greater than Self CUDA time total (CPU ms vs GPU us): overhead-bound. Make the work larger or fuse.
Self CPU ≈ Self CUDA in ms: compute-bound. GPU is the bottleneck.
One event dominates cuda_time_total: start optimizing there.
An event with many calls (# of Calls large): potential for batching or fusion.
cudaOccupancyMaxActiveBlocksPerMultiprocessor appears: adaptive heavy kernel (GEMM, conv).
Unexpected cudaMemcpyAsync before a GEMM: DtoD copy for addmm epilogue.
Activity Buffer Request on the GPU lane: the profiler or runtime is requesting/filling buffers and can explain offsets between lanes.
Variation in times of the same kernel across steps: read the full trace, not just the mean.

Recomendaciones prácticas rápidas

Start profiling with simple cases and watch how metrics change as you increase size and batching.
Use warmup (both outside and inside the profiler) to remove cold-start costs.
If the CPU dominates, try fusing operators or moving work to the GPU in larger chunks.
Inspect the full trace in Perfetto to find gaps and implicit memcpys; the table alone won't tell the whole story.
When using torch.compile, inspect the CPU tree: Dynamo cache hits, AOT wrapper, and the Call CompiledFxGraph will tell you how much CPU is being paid.

After finishing this first exercise with a matmul + bias you already have several lenses to read more complex traces: distinguish overhead from compute, identify cuBLAS planning, understand implicit copies, and see what torch.compile really does. In upcoming posts we'll raise the complexity: first nn.Linear and an MLP, then large models with transformers. The idea is to build intuition step by step so when you open a trace it's not terror but detective curiosity.

Fuente original

https://huggingface.co/blog/torch-profiler

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.