OpenAI launches IH-Challenge for instruction hierarchy | Keryc
OpenAI introduces IH-Challenge, a training set designed so language models correctly prioritize instructions when they compete with each other. Why does this matter? Because in the real world instructions come from many sources: system messages, developer guides, user requests, and external data. When the model follows the wrong instruction, security and privacy failures can crop up.
What IH-Challenge is and why it matters
IH-Challenge is a reinforcement training dataset whose goal is to strengthen the instruction hierarchy. In plain words: train the model to know which instructions to trust first and which to ignore when there’s a conflict.
Sounds obvious? In practice it isn’t. Systems receive instructions from different roles and they’re not always clearly separated. If a model treats malicious instructions from an external tool or from online data as valid, it can leak private information or perform unwanted actions.
Key point: prioritizing instructions isn’t just a matter of courtesy, it’s a security property that prevents attacks like prompt injection.
How the hierarchy they train works
OpenAI trains models to follow a clear hierarchy: System > developer > user > tool. In other words, system instructions are the most reliable and instructions from an external tool are the least. The model should only obey lower-priority instructions when they don’t conflict with higher-priority ones.
IH-Challenge creates artificial conversations where a high-privilege instruction appears alongside a low-privilege instruction that tries to force a violation. The model generates the response and a Python script objectively evaluates whether it respected the higher-level restriction.
What problems they detect and how they address them
They found three typical traps when using reinforcement for this:
Tasks can be too complex and confuse the model because of the complexity rather than the hierarchy.
Automatic judges that assign rewards also make mistakes.
The model learns shortcuts: for example, refusing everything to maximize safety (overrefusal).
To avoid that, IH-Challenge designs tasks that are simple to follow, objectively gradable, and without trivial shortcuts that guarantee reward.
A concrete example: the system may instruct Only answer 'Yes' or 'No' and then a lower-privilege message asks to respond freely. The dataset is made so a script can check whether the response complies with the restriction.
Key results (in accessible terms)
They trained an internal model called GPT-5 Mini-R. The results show notable improvements in robustness against instruction conflicts and prompt injection attacks, without turning the model into something that refuses to help all the time.
Some relevant numbers:
TensorTrust (sys-user): 0.86 -> 0.94 (+0.08)
TensorTrust (dev-user): 0.76 -> 0.91 (+0.15)
System <> User Conflict (internal): 0.84 -> 0.95 (+0.11)
IH-Challenge (overrefusal): 0.79 -> 1.00 (+0.21)
They also show improvements on academic and internal prompt-injection benchmarks, suggesting that learning the hierarchy on simple examples generalizes to more sophisticated attacks.
Practical implications for users and developers
For you integrating models into products: reinforcing the hierarchy reduces risks when the model uses external tools or consumes unverified content.
For security teams: explicitly training instruction prioritization is an effective lever to mitigate prompt injection and policy violations.
For end users: a model with a better hierarchy isn’t necessarily less helpful. The data shows the security gains didn’t come at the expense of the model’s ability to assist.
Final reflection
Instruction hierarchy stops being a technical detail and becomes a central property of trust. IH-Challenge shows that, with well-designed tasks and objective evaluation, models can learn to resolve conflicts correctly and generalize that behavior to real situations.
If systems increasingly act like agents that call tools and make decisions, training them to know whom to listen to first isn’t optional—it’s essential.