Async RL: lessons from 16 open-source libraries

If you ever wondered why your GPUs sit idle for large stretches when you train reasoning models, this note is for you. Here’s the technical but digestible version of a wide survey: 16 open-source libraries that have already solved (in different ways) the problem of asynchronous training in RL for long-reasoning models.

El problema en pocas palabras

In a traditional RL loop, autoregressive generation (inference) eats most of the wall time. A single batch of rollouts of 32K tokens on a 32B model can take hours, while the training GPUs sit idle.

What’s the consequence? Low GPU utilization, huge latencies, and bottlenecks from the so-called straggler problem: a few slow samples block whole batches.

Quick numerical example (vLLM benchmarks, H100 80GB, bf16):

7B ≈ 6,300 tokens/s aggregate.
32B ≈ 1,200 tokens/s aggregate.

That means, for example, 512 rollouts of 8K tokens on a 32B model can take almost an hour on a single GPU. Can you imagine waiting that long for every training step?

Output per rollout	Total tokens (512)	7B (6.3K tok/s)	32B (1.2K tok/s)
2K	~1M	~3 min	~14 min
8K	~4M	~11 min	~56 min
32K	~16M	~45 min	~3.7 hours

El problema en pocas palabras

La solución que se impone: disociar y poner flujo

Las siete dimensiones que determinan el diseño

Hallazgos clave del relevamiento (resumido)

Modelos de interrupción y transferencia de pesos

Datos concretos para dimensionar (tabla resumida)

Casos complejos y problemas emergentes

Relevancia de LoRA y LoRA para MoE

Diseño propuesto para el async trainer de TRL (concreto y técnico)

Recomendaciones prácticas para equipos

Reflexión final

Original source

Stay up to date!

Async RL: lessons from 16 open-source libraries