Asynchrony in continuous batching optimizes LLM inference

The CPU and GPU often end up waiting on each other, and that eats time and money. Can you imagine paying $140 a day for an H200 and seeing the GPU idle a quarter of the time because it’s waiting for the CPU? Here I explain how to separate CPU and GPU work so both run in parallel and you squeeze more performance out of large-model inference.

Why this matters

If you run inference at scale—say on endpoints with H200—every minute counts. Continuous batching already improves utilization because it reduces padding and groups requests efficiently. But the next bottleneck is synchrony: CPU and GPU take turns, and in loops with hundreds of steps per second those gaps add up.

In an experiment with an 8B model, batch 32 and 8K tokens, the synchronous cycle took 300.6 s and the GPU was idle 24% of the time. In the asynchronous version the GPU was active 99.4% of the time and total time dropped to 234.5 s. Result? A real 22% improvement in generation time, without touching models or kernels.

Why this matters

Central idea: untangle CPU and GPU

Streams and events: the technical base

Pipeline H2D -> compute -> D2H with events

Avoiding data corruption: double slot and memory pool

Carry-over: passing tokens from N to N+1

The full asynchronous loop

Practical results and numbers

Implementation and production recommendations

Reflective conclusion

Original source

Stay up to date!

Asynchrony in continuous batching optimizes LLM inference