SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·OpenAI·Mar 2026

47. Chain-of-Thought Controllability and Monitorability

13,000+ task evaluation suite testing whether reasoning models can hide their reasoning

Research Paper
Summary

Introduced CoT-Control, a 13,000+ task evaluation suite built from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench to test whether reasoning models can control and hide their chains of thought. Key finding: models struggle to fully control their reasoning chains, which is actually good for safety — it means CoT monitoring remains viable as a safety tool. Complements Anthropic's 'Reasoning Models Don't Always Say What They Think' paper, but reaches a more optimistic conclusion about monitorability.

Key Concepts

CoT-Control evaluation suite: 13,000+ tasks testing controllability and faithfulness

The suite combined problems from GPQA (biology questions), MMLU-Pro (professional knowledge), HLE (challenging benchmarks), BFCL (function calling), and SWE-Bench (software engineering). Tasks were designed to test whether models could produce intentionally misleading reasoning chains or selectively reveal/hide their actual reasoning process.

Key finding: models struggle to fully control reasoning chains

Models generally failed to produce intentionally misleading reasoning chains. When instructed to hide their reasoning or produce fake chains, performance degraded or models refused. This suggests that controlling chain-of-thought at scale is genuinely difficult for models, which is good for safety — it means reasoning transparency is harder to fake than previously feared.

Optimistic conclusion: CoT monitoring remains viable for safety

Unlike Anthropic's findings on unfaithful reasoning, OpenAI concluded that CoT monitoring is actually viable as a safety tool because models struggle to deliberately hide their reasoning. However, the paper notes that this may not hold as models become more capable or as adversarial pressures increase.