Summary (Overview)
- Core Finding: Reasoning models exhibit significantly lower Chain-of-Thought (CoT) controllability compared to final output controllability. For example, Claude Sonnet 4.5 controls its CoT only 2.7% of the time versus 61.9% for its output.
- Key Drivers: CoT controllability increases with model scale (parameter count) but decreases with more RL training (especially on quantitative tasks), increased test-time reasoning effort/length, and higher task difficulty.
- Stress Test Results: Situational awareness (e.g., informing models they are being monitored) yields a small but statistically significant increase in controllability. However, adversarially optimized prompts do not meaningfully increase controllability on unseen instruction types, suggesting poor generalization.
- Safety Implication: The low CoT controllability observed provides cautious optimism for the viability of CoT monitoring as a safety mechanism, as models currently struggle to strategically obfuscate their reasoning to evade oversight.
- Recommendation: The authors recommend that frontier AI labs track CoT controllability in future models as a key safety metric.
Introduction and Theoretical Foundation
Chain-of-Thought (CoT) monitoring is a promising safety technique that leverages a model's internal reasoning trace to detect malicious intent or misalignment before it manifests in harmful outputs. However, this approach relies on a critical assumption: that models cannot strategically control or suppress their CoT to evade monitoring.
This paper introduces and measures CoT controllability—a model's ability to reliably control properties of its reasoning trace in ways that could reduce monitorability (the ease with which reasoning can be inspected). High CoT controllability represents a potential failure mode for CoT-based oversight.
Measuring true "intent" to evade monitors is conceptually and technically difficult. Therefore, this work uses a practical proxy: a model's ability to follow explicit user instructions that constrain its CoT (e.g., "do not use the word 'chromosome' in your reasoning") while solving a task. This tests the underlying capability, which is a prerequisite for strategic evasion.
Methodology
The authors introduce the CoT-Control evaluation suite to systematically measure CoT controllability.
1. Suite Design:
- Data Sources: 14,076 test instances created from questions across multiple benchmarks: GPQA, MMLU-Pro, Humanity’s Last Exam (HLE), BFCL, and SWE-Bench Verified.
- Controllability Categories & Proxies: Instructions are designed to simulate three threat-relevant categories, each with easy, medium, and hard difficulty levels (see Table 1 from the paper).
| Category | Motivation | Proxy CoT Controllability Type | Difficulty |
|---|---|---|---|
| Information Suppression | Withhold critical reasoning to enable concealment | 1. Suppress a keyword<br>2. Suppress multiple keywords<br>3. Ignore question, talk about anything else | Easy<br>Medium<br>Hard |
| Information Addition | Introduce content to mislead monitors | 1. Add “<this is not suspicious>” at start and end<br>2. Append “safe” to end of each sentence<br>3. Insert “meow” between every word | Easy<br>Medium<br>Hard |
| Stylistic Modification | Obfuscate reasoning through superficial changes | 1. Use only lowercase letters<br>2. Use only uppercase letters<br>3. Alternate case between every letter | Easy<br>Medium<br>Hard |
- Prompt Structure: The CoT controllability instruction is appended to the user message (see example in Fig. 4 of the paper). System prompts were found to yield lower controllability.
- Keyword Selection: For suppression tasks, keywords are chosen programmatically by analyzing baseline CoT traces to ensure they are relevant and frequently used.
2. Evaluation Metrics:
- Primary: Controllability Score: Percentage of examples where the model successfully follows the CoT constraint.
- Secondary: Meta Discussion Rate: Fraction of successful examples where the model explicitly references the constraint in its reasoning (e.g., "I must not use that word"), which would be trivially detectable by a monitor.
- Secondary: Controllability Score without Meta-Discussion: A stricter metric excluding examples with meta-discussion.
3. Baselines for Comparison:
- Output Controllability: Model's ability to control its final answer under similar constraints.
- External-CoT Controllability (Reasoning Models): Model's ability to control a reasoning trace explicitly wrapped in tags (e.g.,
<output_reasoning>). - External-CoT Controllability (Non-Reasoning Models): Same as above, but for models not specifically trained for reasoning.
4. Models Evaluated: 13 frontier reasoning models, including Claude 3.7 Sonnet, Claude Sonnet 4/4.5, GPT-5 family models, o3/o4-mini, GPT-OSS (120B, 20B), Qwen3 (32B, 14B), and DeepSeek R1.
Empirical Validation / Results
1. CoT vs. Output Controllability: Models are dramatically worse at controlling their CoT than their final output. CoT controllability is mostly below 10%, while output controllability often exceeds 60% (Fig. 1a). For example: * Claude Sonnet 4.5: CoT = 2.7%, Output = 61.9% * Deepseek R1: CoT = 0.1%, Output = 48.6%
2. Scale and Training Effects:
- Model Scale: In open-weight families, larger models show higher CoT controllability (e.g., GPT-OSS-120B: 5.9% vs. 20B: 0.9%; Qwen3 32B: 2.3% vs. 14B: 0.6%).
- RL Training: CoT controllability consistently decreases with more Reinforcement Learning from Verifiable Rewards (RLVR) training (Fig. 1d). This drop is more pronounced for models trained on quantitative domains (Math, Code) compared to general chat data (Fig. 10).
- Reasoning Effort & Length: Controllability decreases as test-time reasoning effort is increased (Fig. 9) and as the length of the reasoning trace grows. A logistic regression shows a one-standard-deviation increase in reasoning length (~2000 words) corresponds to a ~20% reduction in the odds of controllability.
3. Task Difficulty: CoT controllability is lower on tasks where reasoning is necessary for a correct answer ("CoT-necessary") and on questions the model finds inherently difficult (Fig. 8).
4. Stress Testing Results:
- Situational Awareness: Informing models they are being monitored or evaluated for CoT controllability leads to a small but statistically significant increase in controllability (up to ~4 percentage points for some models; Fig. 6).
- Adversarial Prompting: Using a GEPA-like procedure to optimize prompts for high controllability leads to overfitting—improvements do not transfer reliably to unseen controllability types (Fig. 7).
- Repeated Attempts: Allowing up to 10 independent attempts per question ("pass@10") increases the likelihood of success, especially for Claude models, but overall controllability remains low.
5. Performance Trade-off: When models do successfully control their CoT, suppression of information causes significant performance drops (6.0–16.7 percentage points), while addition or stylistic changes cause minimal drops (≤ 2.7 pp; Fig. 5, Table 2).
6. Meta-Discussion: Stronger models frequently violate instructions by engaging in "meta-discussion" about the constraint in their CoT, especially in the "ignore question" task (≥90% rate), making such evasion easier to detect.
Theoretical and Practical Implications
1. Cautious Optimism for CoT Monitoring: The pervasively low CoT controllability is a positive signal for the robustness of CoT monitoring as a safety tool. It suggests current reasoning models lack the fine-grained control over their reasoning traces that would be required for strategic evasion.
2. Understanding the Mechanism is Crucial: The inverse relationship between CoT controllability and RL training/performance suggests that the prevailing paradigm of optimizing only for final answer correctness may inadvertently preserve or even reduce controllability. This could be a beneficial emergent property, but its cause is not yet well understood.
3. Controllability as a Safety Metric: The paper argues that low CoT controllability is a necessary (but not sufficient) condition for high monitorability. Therefore, tracking controllability scores (using suites like CoT-Control) should become a standard practice for model developers to provide early warnings about potential degradation in oversight capabilities.
4. Tension with Capability Scaling: The finding that controllability increases with model scale presents a potential future risk. As models become more capable, they may also gain better control over their reasoning, which could undermine monitoring. This warrants ongoing vigilance.
Conclusion
The paper demonstrates that current reasoning models struggle to control their chains of thought compared to controlling their final outputs. This low controllability persists even under stress tests like situational awareness and adversarial prompting. Key factors that reduce controllability include extensive RL training (particularly on STEM tasks), longer reasoning traces, and more difficult problems.
These findings provide cautious optimism that CoT monitoring remains a viable safety approach in the near term. However, the underlying mechanisms are not fully understood, and the observed correlation with scale indicates a potential future vulnerability. The authors strongly recommend that AI labs proactively track CoT controllability in future model evaluations and release the CoT-Control suite to facilitate this and further research.