Asynchrony in continuous batching optimizes LLM inference | Keryc
The CPU and GPU often end up waiting on each other, and that eats time and money. Can you imagine paying $140 a day for an H200 and seeing the GPU idle a quarter of the time because it’s waiting for the CPU? Here I explain how to separate CPU and GPU work so both run in parallel and you squeeze more performance out of large-model inference.
Why this matters
If you run inference at scale—say on endpoints with H200—every minute counts. Continuous batching already improves utilization because it reduces padding and groups requests efficiently. But the next bottleneck is synchrony: CPU and GPU take turns, and in loops with hundreds of steps per second those gaps add up.
In an experiment with an 8B model, batch 32 and 8K tokens, the synchronous cycle took 300.6 s and the GPU was idle 24% of the time. In the asynchronous version the GPU was active 99.4% of the time and total time dropped to 234.5 s. Result? A real 22% improvement in generation time, without touching models or kernels.
Central idea: untangle CPU and GPU
The idea is simple to state and a bit trickier in practice: prepare batch N+1 on the CPU while the GPU computes batch N. Why isn’t this the default? Because there are three key challenges:
Regaining control of the CPU after launching work on the GPU.
Ensuring data is ready when each operation starts.
Building batch N+1 if it depends on predictions produced by batch N.
To solve them we use CUDA streams, events, slot buffers and an operation called carry-over. You’ll see the step-by-step next.
Streams and events: the technical base
A CUDA stream is an ordered queue of GPU operations. Operations in the same stream are sequential; between streams they can run concurrently. The usual problem is PyTorch’s default stream: it’s synchronizing. If you use the default stream, the CPU will look blocked until the GPU finishes everything. That kills any attempt to parallelize.
The recipe is to use non-default streams: one for host-to-device (H2D), another for compute (compute) and another for device-to-host (D2H). But because streams don’t wait for each other automatically, we introduce CUDA events.
An event is recorded in a stream with stream.record(event) and another stream can do stream.wait(event) so it won’t start until the event is marked. Important: wait blocks the stream on the GPU, not the CPU. That lets the CPU enqueue work and keep working without waiting.
Pipeline H2D -> compute -> D2H with events
The per-step pattern is:
CPU prepares inputs on host (no stream).
Enqueue the H2D copy on h2d_stream and record h2d_done.
compute_stream does wait(h2d_done) and enqueues the forward.
Record compute_done from compute_stream.
d2h_stream does wait(compute_done) and enqueues the D2H.
Finally, CPU synchronizes with a d2h_done_event.synchronize() to read results.
This way the CPU doesn’t block for most of the cycle; there’s only a final synchronization per batch when outputs arrive to the host.
Avoiding data corruption: double slot and memory pool
If you reuse the same device buffer for batch N and N+1 and the CPU starts overwriting it, the GPU can read partial data. The practical solution:
Have two buffer slots (A and B) and alternate: while the GPU processes A, the CPU prepares B.
This duplicates buffers, but prevents races.
Next problem: if you use CUDA graphs (very useful for latency) each capture is tied to memory addresses. With two slots you’d need two graphs. If each graph allocates different memory, VRAM increases. The workaround is a shared memory pool: both graphs can use the same pool as long as they don’t execute in parallel. In practice peak VRAM usage is close to the maximum of the graphs, not their sum.
Carry-over: passing tokens from N to N+1
If a request appears in both batch N and N+1, the token produced by N must be input for N+1. But when you prepare N+1 you don’t yet have that token. The solution is:
Prepare N+1 with a token placeholder (for example 0).
After N finishes and before N+1’s forward, perform the carry-over operation.
Carry-over uses a mask that indicates, by position, where to copy produced tokens. Essentially:
Select tokens to carry from outputs of N.
Zero the positions that don’t apply.
Truncate/pad to N+1’s size.
Add the resulting tensor to N+1’s input ids (placeholders were zero).
All these ops are cheap and captured in the CUDA graph so they don’t add overhead at critical time.
The full asynchronous loop
Typical sequence:
Step 0: cold start. CPU prepares batch 0 in slot A and dispatches it.
Step 1: GPU processes batch 0 in A; CPU prepares batch 1 in B (evict/admit, update KV cache routing, build carry-over mask).
CPU enqueues H2D(B), records and chains events, and continues.
GPU runs the pipeline: D2H of A, H2D of B, compute B when H2D(B) completes.
CPU synchronizes with d2h_done of A, processes outputs, updates state, builds batch 2 in A, and the cycle continues.
As long as N+1’s input is ready on device when N finishes, the GPU won’t go idle between batches. Usually the GPU remains the bottleneck, so the CPU finishes its work before the GPU, making overlap effective.
Practical results and numbers
In the reported experiment:
Synchronous: 300.6 s total, GPU active 76.0%, GPU idle 24.0% of the time.
Asynchronous: 234.5 s total, GPU active 99.4%, 22% improvement in time.
No model or kernel changes. Coordinating streams, events, double buffering and carry-over was enough.
Implementation and production recommendations
Check the implementation in transformers (entry point: continuous_batching.py, async code in ContinuousBatchingAsyncIOs).
Main pieces to implement:
Separate H2D, compute, D2H into non-default streams.
Use events for dependencies between streams.
Double buffer slots and a memory pool for CUDA graphs.
Capture carry-over in the graphs.
Watch VRAM consumption and the tradeoff between lower latency and higher complexity (more synchronized objects). For long-generation loads (16K+ tokens) and RL scenarios the gain is especially relevant.
If your pipeline uses different frameworks or older drivers, validate that stream and event behavior is the same: default stream behavior details can vary.
Reflective conclusion
Decoupling CPU preparation from GPU execution changes the equation: it’s no longer just about grouping requests but about coordinating hardware to work in parallel. With CUDA streams, events, slot buffers and carry-over you can turn GPU idle time into useful work without touching models. It’s a small architectural change with big impact on cost and performance.