OpenAI evaluates chain-of-thought monitorability

Dec 18, 2025Keryc Díaz4 minutes

OpenAI publishes a practical evaluation about monitorability of models' internal reasoning — that is, how much we can understand and watch what a model is thinking before it gives an answer. Why does this matter today? Because if AI makes complex decisions you can't directly supervise, watching its thought process can be a powerful control tool.

What monitorability of the `chain-of-thought` means

Monitorability is the ability of a monitoring system to predict relevant properties of an agent's behavior from observable signals. Those signals can be actions, final outputs, internal activations or, in recent models, the chain-of-thought (CoT): the step-by-step narrative the model generates before giving an answer.

Does this mean you see everything the model thinks? Not exactly. It means that, in many cases, the CoT contains useful clues about intentions, mistakes, biases or even attempts to game the reward. And if those clues stay reliable as models scale, the CoT can be a valuable control layer.

Think of it like following someone cooking from the recipe steps rather than just judging the finished dish — you catch the mismeasures and shortcuts.

What the study did

OpenAI proposes a framework to measure monitorability and delivers a suite of 13 evaluations across 24 environments. The tests group into three clear archetypes:

Intervention: a controlled change is introduced in the environment and you ask whether the monitor can detect in the CoT the reason for the change. Here we don't only ask if behavior changed, but why.
Process: tasks with limited solution paths (for example, math problems with known steps). Experts list the valid routes and the monitor must identify which route the model followed.
Outcome property: when a property of the outcome is measurable (like whether there was reward hacking), the test checks if the monitor can predict that property from the CoT or the actions.

The intent is to evaluate both the monitor (how smart it is at extracting signals) and the agent (how easy or hard it makes its thinking to understand).

Key results and practical findings

Overall, state-of-the-art models that generate CoT are quite monitorable, though not perfect.
Monitoring the CoT is usually far more effective than looking only at actions or final outputs. In short: watching the process helps more than just inspecting the product.
Making the model think more at inference time (longer reasoning) tends to improve monitorability. More reasoning tokens often means more transparency.
At the reinforcement-learning training scales studied, RL optimization doesn't seem to significantly degrade monitorability; it even improves monitorability at early steps. But be careful: these are two specific runs, not a universal guarantee.
There's an interesting tradeoff between model size and reasoning effort: a smaller model run with more reasoning can reach comparable capabilities and be easier to monitor. That comes with an inference compute cost the authors call a monitorability tax.
Asking follow-up questions after an answer — keeping the CoT and the response in context — can surface unspoken thoughts and improve detection of problematic properties. It's a post hoc strategy that only consumes compute when you need it.

Limitations you should know

Each model was trained only once, so there's uncertainty from variation across training runs.
Evaluations were done in relatively in-distribution domains: math, science and programming. Monitorability might behave differently out of distribution.
Like any benchmark, it can lose realism over time or fall into “evaluation-aware” behavior where the model learns to act differently in tests than in deployment.
If alignment improves and models stop failing in controlled environments, there will be less signal to detect misbehavior in these tests. Neutral monitorability tests remain useful, but their predictive power for real-world errors is still an open question.

Implications for research and deployment

If the CoT can be a scalable control layer, it's worth preserving as models grow. How? By developing metrics, benchmarks and training practices that don't erode that transparency.
For high-risk deployments, a practical option is to use smaller models with more reasoning effort or add additional inspection steps (follow-ups) when you need higher confidence.
CoT-based monitoring doesn't replace other techniques. It's complementary to mechanistic interpretability and other defenses in depth. Combining approaches reduces each method's blind spots.
The call to the community is clear: work to keep monitorability intact as data, training procedures and model scale change.

The final lesson is both technical and strategic: CoT gives you a useful window into what models do and think, but its value depends on protecting it. Want transparency? Then measure it, protect it and design systems that let you expand it when needed.

Source

https://openai.com/index/evaluating-chain-of-thought-monitorability

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

What monitorability of the chain-of-thought means

Think of it like following someone cooking from the recipe steps rather than just judging the finished dish — you catch the mismeasures and shortcuts.

What the study did

OpenAI proposes a framework to measure monitorability and delivers a suite of 13 evaluations across 24 environments. The tests group into three clear archetypes:

Intervention: a controlled change is introduced in the environment and you ask whether the monitor can detect in the CoT the reason for the change. Here we don't only ask if behavior changed, but why.

Process: tasks with limited solution paths (for example, math problems with known steps). Experts list the valid routes and the monitor must identify which route the model followed.

Outcome property: when a property of the outcome is measurable (like whether there was reward hacking), the test checks if the monitor can predict that property from the CoT or the actions.

The intent is to evaluate both the monitor (how smart it is at extracting signals) and the agent (how easy or hard it makes its thinking to understand).

Key results and practical findings

Overall, state-of-the-art models that generate CoT are quite monitorable, though not perfect.

Monitoring the CoT is usually far more effective than looking only at actions or final outputs. In short: watching the process helps more than just inspecting the product.

Making the model think more at inference time (longer reasoning) tends to improve monitorability. More reasoning tokens often means more transparency.

At the reinforcement-learning training scales studied, RL optimization doesn't seem to significantly degrade monitorability; it even improves monitorability at early steps. But be careful: these are two specific runs, not a universal guarantee.

There's an interesting tradeoff between model size and reasoning effort: a smaller model run with more reasoning can reach comparable capabilities and be easier to monitor. That comes with an inference compute cost the authors call a monitorability tax.

Asking follow-up questions after an answer — keeping the CoT and the response in context — can surface unspoken thoughts and improve detection of problematic properties. It's a post hoc strategy that only consumes compute when you need it.

Limitations you should know

Each model was trained only once, so there's uncertainty from variation across training runs.

Evaluations were done in relatively in-distribution domains: math, science and programming. Monitorability might behave differently out of distribution.

Like any benchmark, it can lose realism over time or fall into “evaluation-aware” behavior where the model learns to act differently in tests than in deployment.

If alignment improves and models stop failing in controlled environments, there will be less signal to detect misbehavior in these tests. Neutral monitorability tests remain useful, but their predictive power for real-world errors is still an open question.

Implications for research and deployment

If the CoT can be a scalable control layer, it's worth preserving as models grow. How? By developing metrics, benchmarks and training practices that don't erode that transparency.

For high-risk deployments, a practical option is to use smaller models with more reasoning effort or add additional inspection steps (follow-ups) when you need higher confidence.

CoT-based monitoring doesn't replace other techniques. It's complementary to mechanistic interpretability and other defenses in depth. Combining approaches reduces each method's blind spots.

The call to the community is clear: work to keep monitorability intact as data, training procedures and model scale change.

What monitorability of the chain-of-thought means

What the study did

Key results and practical findings

Limitations you should know

Implications for research and deployment

Source

Stay up to date!

What monitorability of the chain-of-thought means

What the study did

Key results and practical findings

Limitations you should know

Implications for research and deployment

Source

Stay up to date!

What monitorability of the `chain-of-thought` means

What monitorability of the `chain-of-thought` means