If you ever wondered why your GPUs sit idle for large stretches when you train reasoning models, this note is for you. Here’s the technical but digestible version of a wide survey: 16 open-source libraries that have already solved (in different ways) the problem of asynchronous training in RL for long-reasoning models.
El problema en pocas palabras
In a traditional RL loop, autoregressive generation (inference) eats most of the wall time. A single batch of rollouts of 32K tokens on a 32B model can take hours, while the training GPUs sit idle.
What’s the consequence? Low GPU utilization, huge latencies, and bottlenecks from the so-called straggler problem: a few slow samples block whole batches.
Quick numerical example (vLLM benchmarks, H100 80GB, bf16):
7B≈ 6,300 tokens/s aggregate.32B≈ 1,200 tokens/s aggregate.
That means, for example, 512 rollouts of 8K tokens on a 32B model can take almost an hour on a single GPU. Can you imagine waiting that long for every training step?
La solución que se impone: disociar y poner flujo
The dominant pattern is simple and powerful:
- Separate inference and training into distinct GPU pools.
- Connect them with a rollout buffer that acts as a shock absorber.
- Synchronize weights asynchronously so neither side waits on the other.
That way inference produces N+K rollouts while the optimizer updates with rollouts N, taking advantage of hardware in parallel.
Las siete dimensiones que determinan el diseño
To compare architectures, the study proposes seven key axes. I leave them as a checklist for when you evaluate or design your own infra:
- Orchestration and concurrency primitive: Ray, asyncio, pub/sub, HTTP, Monarch, etc.
- Buffer design: no buffer, double buffer, bounded queue, unlimited stream.
- Weight sync protocol: NCCL broadcast, filesystem, HTTP, CUDA IPC, bucketing.
- Staleness management: version-based drop, depth bounding, importance sampling (IS) correction.
- Partial rollout handling: continue, abort-and-retry, save/summarize, drain before sync.
- LoRA support: adapter-only sync vs full-parameter sync.
- Training backend and parallelism: FSDP, DeepSpeed, Megatron, JAX/XLA, MoE and EP support.
Each decision changes performance, complexity and the numerical corrections required.
Hallazgos clave del relevamiento (resumido)
- Ray dominates orchestration: 8 of 16 libraries. The actor model fits naturally for heterogeneous components.
- NCCL broadcast is the default route for weight transfer; several implementations use bucketed NCCL to reduce latency.
- Staleness is handled with three strategies: version-drop, depth bounding (bounded queue) or IS-weighted loss. Real solutions often combine approaches.
- LoRA is supported in many libraries, but adapter-only sync is not yet universal. When present, it transforms the sync problem (from GB to MB).
- MoE and EP become the emerging differentiator; not all libraries support Expert Parallelism correctly. That complicates routing, sync and LoRA per-expert.
- Generation interruption has multiple granularities: from never-stop per forward-pass (e.g.
PipelineRL) to blocking by batch or step.
Modelos de interrupción y transferencia de pesos
How updates behave determines how much in-flight work you waste:
- Never-stop (swap between forward passes) allows millisecond-scale interruptions.
- Per-request abort + resume recycles partial work but adds complexity.
- Soft pause (drain in-flight) avoids aborts but introduces synchronization bubbles.
- Full-step/blocking is simple but expensive in idle time.
On transport, NCCL bucketing can reduce latencies from hundreds of ms to tens of ms in optimized implementations.
Datos concretos para dimensionar (tabla resumida)
| Output per rollout | Total tokens (512) | 7B (6.3K tok/s) | 32B (1.2K tok/s) |
|---|---|---|---|
| 2K | ~1M | ~3 min | ~14 min |
| 8K | ~4M | ~11 min | ~56 min |
| 32K | ~16M | ~45 min | ~3.7 hours |
These numbers explain why the community moved to asynchronous architectures.
Casos complejos y problemas emergentes
A queue and NCCL don’t solve everything. Some challenges that show up in production:
- Critic-free algorithms reduce memory but increase sync pressure because they need more rollouts per step.
- Process reward models (PRM) make scoring costly and require asynchronous scoring pipelines.
- Multi-agent and co-evolution multiply the straggler effect; the unit of work becomes episodes, not isolated rollouts.
- Training-inference mismatch in MoE: expert routing can differ between inference and training. Needed: 'Keep Routing' (log and reuse routing during the training forward).
- Sampling mask mismatch: top-k/top-p changes the action space between sampling and evaluation; you must record and reapply the mask so IS stays valid.
If you work with MoE, these are correctness problems, not just performance.
Relevancia de LoRA y LoRA para MoE
LoRA reduces transfer size to a few MB when the inference server supports hot-swapping adapters. That makes most production models viable with practically undetectable sync. But in MoE:
- Each expert may need its own adapters. With 64 experts, adapters scale up and are distributed across EP ranks.
- LoRA sync in MoE implies gathering adapters from multiple ranks before pushing them to the inference server.
In practice, few libraries (e.g. ART with Megatron EP) implement MoE-LoRA end-to-end.
Diseño propuesto para el async trainer de TRL (concreto y técnico)
If you were to design it from scratch, these are the study’s suggested choices:
- Keep the stack simple: avoid heavy runtime as a mandatory dependency, but design the API so users can plug it in if they need it.
- Go for a bounded queue where each token carries a
model_versiontag. That enables IS correction at token granularity and staleness gating without technical debt. - Use NCCL process groups with bucketing to pack parameters and reduce calls. vLLM and similar engines already support packed broadcasts.
- Explore advanced sync engines (Awex, Mooncake) to convert between training layouts (Megatron, FSDP) and inference layouts (vLLM, SGLang) without a full checkpoint.
- Support LoRA from day one, with an optimized path for adapter-only sync when the inference server accepts it.
- Try two strategies for in-flight rollouts on an update: prefix-resume (save KV cache and pick up) and abort-and-retry. Each has trade-offs; prefix retaining compute needs support in the inference engine.
- Record extra metadata during generation: expert routing, sampling masks, token-level logprobs. Without that you can’t guarantee IS correction or reproducibility in MoE.
Recomendaciones prácticas para equipos
- If your scale is < 16 GPUs and you control the whole stack, an
asyncio+ Redis streams solution may be enough. - For production at 64+ GPUs, Ray (or Monarch) eases scheduling, autoscaling and fault tolerance.
- Implement LoRA adapter-only sync if your inference engine supports it; it changes the sync problem forever.
- For MoE, require EP-aware training and a weight-sync plan that considers per-expert AllGather.
- Add per-token version telemetry from the start; fixing staleness without that data is costly.
Reflexión final
The practical lesson is clear: asynchronous architecture is no longer experimental. Separating inference and training, defining a well-thought buffer and designing sync and staleness protocols are requirements to train reasoning models at scale. Scared of the complexity? Good — that’s a sign: complexity asks you to design health metrics, staleness gates and automated tests. If you solve it well, your GPUs will stop sleeping and your research will move much faster.
