LinkedIn shares a hands-on walkthrough to make GPT-OSS work as the backbone for agentic RL training: it's not just about tuning weights, but about solving a stack of incompatibilities between MoE, attention kernels, parallelism, and the inference infrastructure.
What is agentic RL and why does it matter?
Agentic reinforcement learning extends classic single-turn RL into policies that plan and act across multi-step trajectories. Instead of optimizing a single reply, you train the model to choose queries, call tools, observe results, and adjust behavior based on rewards that depend on long-term decisions.
Why should you care? Because real-world apps (recruiting, tool-enabled assistants, multi-step flows) need this adaptive ability: fetching info, refining prompts, and coordinating tools in series — not just producing a pretty text.
Problems that surfaced when trying agentic RL with GPT-OSS
During experiments with verl and GPT-OSS-20B, several serious failures appeared:
- KL and entropy exploded, and rewards didn’t rise.
- Non-zero importance sampling clipping values even when on-policy (PPO expects ratio = 1).
- Exploding gradients and a training–inference mismatch between FSDP and inference engines (vLLM / SGLang).
- Instability tied to handling of attention sinks and FlashAttention kernels.
- Unexpected OOMs due to materialization of states in the MoE path while computing log-probs.
These symptoms were not noise: they pointed to subtle interactions between the MoE architecture, the attention sink implementation, and differences between inference and training code paths.
Technical diagnosis (essential summary)
1) Dual forward pass in MoE and importance ratio
The computation of old_log_prob was done in a separate pass from the current log_prob. In MoE models the gating network can route to different experts between the two passes because of tiny numerical differences or stochasticity. That changes log_prob and produces a ratio different from 1, triggering PPO clipping and breaking the on-policy assumption.
Logical fix applied:
if on_policy:
old_log_prob = log_prob.detach()
else:
old_log_prob = model_inputs["old_log_probs"]
By forcing old_log_prob to equal the freshly computed log_prob (using detach() to avoid gradient flow) you restore on-policy integrity.
2) Attention sinks: forward/backward and training–inference mismatch
Attention sinks are scalar parameters per head that absorb part of the softmax mass as a “virtual token”. In the implementation used:
- The sink participated in softmax normalization but didn’t contribute to the output (it didn’t multiply V).
- The backward wasn’t supported in upstream FlashAttention v2/v3, so the sink’s gradient was undefined and training diverged.
Implementing the sink’s backward (reusing the adjusted forward from the vLLM fork and adding the sink gradient) fixed a large token-level mismatch between inference and training, improving stability.
3) MoE materialization and OOM
When computing log-probs under FSDP, the inference path that duplicates hidden_states with repeat(num_experts, 1) and performs batched ops was triggered. That materializes huge tensors and causes OOMs, even for GPT-OSS-20B on large context windows.
The training path processed experts in a for-loop (slower but much more memory-efficient). The fix was to patch the materialization and force a more efficient path that doesn’t duplicate states unnecessarily. This issue is reported on Hugging Face: https://github.com/huggingface/transformers/issues/40073
4) Rollout correction and training–inference mismatch
Aggressive optimizations in inference engines (e.g., SGLang with Triton) produce numerical differences versus the training stack (FSDP + FlashAttention-v2). That can turn an apparently on-policy update into off-policy and destabilize learning. Applying rollout corrections (sequence-level importance sampling) and aligning routing reduces these discrepancies and stabilizes gradients.
Solutions implemented by the team
- Restore on-policy integrity in PPO: replace
old_log_probwith thelog_prob.detach()when the minibatch is on-policy. - Implement backward for the attention sink in FlashAttention v3 (based on the vLLM fork), fixing gradient computation and the token-level mismatch.
- Patch the MoE path in Transformers to avoid materializing all experts and reduce memory during
compute_log_prob. - Integrate sequence parallelism compatible with attention sinks and FlashAttention v3 to scale long contexts (max response 16k, prompt 8k) without blowing memory.
- Keep rollout correction (sequence-level IS) and isolation tests (e.g., freezing attention layers) to locate bottlenecks.
Experimental results summarized
After the fixes, training with GPT-OSS-20B showed:
- Stable gradient norms with no explosions.
- Faster and better reward improvement on GSM8K (single-turn) and on agentic tasks like ReTool.
- Improvements in instruction-following tasks verified (VerifyIf) and in out-of-domain evaluations.
- Validation that sink fixes and memory optimizations are necessary for GPT-OSS to converge at rates comparable to dense variants and benchmark models like Qwen-2.5-32B in metric trends during RL.
In short: after the fixes, agentic RL training on GPT-OSS is stable and usable as a backbone for multi-step agents.
Practical tips for teams who want to reproduce or extend this
- If you work with MoE and PPO, check that
old_log_probandlog_probcome from the exact same pass, or replaceold_log_probin on-policy mode. - Verify compatibility between attention kernels in training and inference: mismatches in
log-pplare a red flag. - Enable rollout correction (sequence-level IS) to mitigate training–inference mismatch.
- Implement sequence parallelism and avoid paths that materialize all experts if you have long windows.
- Add unit tests for attention sinks (forward+backward) and log routing paths in MoE to detect nondeterminism.
Final reflection
The lesson is clear: bringing an open-source LLM into the agentic world isn’t just “more data and more GPUs”. You need to inspect how architecture (MoE and sinks), attention kernels, and parallelism topology interact. A small change in the inference path or in a sink implementation can turn stable learning into a quick collapse.
If you’re building agents with open LLMs, plan engineering time: ensuring deterministic behavior in MoE, full backward support for special mechanisms like sinks, and memory-efficient execution paths are investments that pay off in stability and scalability.
