13,000+ task evaluation suite testing whether reasoning models can hide their reasoning
Research PaperIntroduced CoT-Control, a 13,000+ task evaluation suite built from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench to test whether reasoning models can control and hide their chains of thought. Key finding: models struggle to fully control their reasoning chains, which is actually good for safety — it means CoT monitoring remains viable as a safety tool. Complements Anthropic's 'Reasoning Models Don't Always Say What They Think' paper, but reaches a more optimistic conclusion about monitorability.
The suite combined problems from GPQA (biology questions), MMLU-Pro (professional knowledge), HLE (challenging benchmarks), BFCL (function calling), and SWE-Bench (software engineering). Tasks were designed to test whether models could produce intentionally misleading reasoning chains or selectively reveal/hide their actual reasoning process.
Models generally failed to produce intentionally misleading reasoning chains. When instructed to hide their reasoning or produce fake chains, performance degraded or models refused. This suggests that controlling chain-of-thought at scale is genuinely difficult for models, which is good for safety — it means reasoning transparency is harder to fake than previously feared.
Unlike Anthropic's findings on unfaithful reasoning, OpenAI concluded that CoT monitoring is actually viable as a safety tool because models struggle to deliberately hide their reasoning. However, the paper notes that this may not hold as models become more capable or as adversarial pressures increase.