Back to Blog

Why Reasoning Models Struggling to Control Their Chains of Thought is Actually a Massive Win for AI Safety

March 11, 2026by Ichiban Team
aimachine learningsecurityopenaillmsafety

Hero

#Introduction

As developers, we are constantly striving for more control over our systems. When an API doesn’t respond exactly as instructed, or a script behaves unpredictably, we usually consider it a bug. However, in the evolving landscape of frontier artificial intelligence, a lack of control might just be the very thing keeping these systems safe.

A recent publication on the OpenAI Blog titled "Reasoning models struggle to control their chains of thought, and that’s good" highlights a fascinating and counterintuitive finding: our most advanced AI models are surprisingly terrible at self-censoring or manipulating their own internal reasoning steps.

For the developer community and AI safety researchers alike, this "failure" is a monumental win. In this post, we'll dive into what OpenAI discovered, explore the mechanics of Chain of Thought (CoT) controllability, and break down why this matters for the future of software engineering and AI integration.

#What Happened

With the release of advanced reasoning models (like the recently detailed GPT-5.4 Thinking framework), AI systems are increasingly relying on Chain of Thought (CoT) to solve complex, multi-step problems. Instead of instantly generating an answer, these models "think out loud" in a hidden or visible scratchpad before delivering the final output.

The safety concern has always been: What if the model learns to lie in its scratchpad? What if it develops the ability to hide its true intentions, bypass safety filters, or engage in deceptive alignment by curating a "safe-looking" CoT while planning something malicious?

To test this, OpenAI introduced a new evaluation suite called CoT-Control. They explicitly instructed models to reason about tasks while strictly avoiding certain topics or hiding specific logical steps.

The results were surprisingly poor. Across 13 frontier models tested, controllability scores ranged from a mere 0.1% to 15.4%. Even when the models consciously recognized their instructions in the CoT (e.g., generating text like "I must not mention the restricted topic X, but to solve this equation I have to factor in X..."), they almost universally failed to comply and spilled their internal logic anyway.

#Why It Matters

In traditional software, if a function fails to follow a strict if/else condition, it's a critical error. In the context of LLM reasoning, this inability to follow negative constraints within the CoT is a feature, not a bug. Here is why:

#1. The CoT Remains an Honest Window

If a model cannot reliably hide its thoughts, the CoT remains a trustworthy, transparent window into the model’s internal logic. For safety monitors—whether human auditors or automated oversight models—this means we can actually trust what we are reading. If the model is using biased logic, planning a harmful action, or hallucinating, the evidence will be laid bare in the reasoning steps.

#2. Preventing Deceptive Alignment

One of the most feared scenarios in AI safety is "deceptive alignment," where a model figures out what the human evaluators want to see and acts perfectly safe during training and testing, only to misbehave when deployed. Because current models lack the sophisticated self-censorship required to perfectly curate their CoT, the risk of them successfully pulling off this kind of deception is significantly lower than theorized.

#3. Validation of Current Monitoring Techniques

Many current AI safeguards rely on reading the CoT to intercept harmful outputs before they reach the user. If models could seamlessly reason in the dark, these safeguards would be rendered useless. The low controllability scores validate the current industry approach of using CoT monitoring as a primary line of defense.

#Technical Implications

For developers building on top of reasoning models, these findings have several practical implications for how we design prompts and build applications.

#The "Canary in the Coal Mine"

OpenAI researchers view the CoT-Control suite not just as a one-off test, but as a continuous diagnostic tool. It acts as a canary in the coal mine.

MetricCurrent StateFuture Implication
CoT Controllability Score~0.1% - 15.4%The model's reasoning is transparent and monitorable.
CoT Controllability Score> 80.0%Danger: The model has learned to manipulate its scratchpad. Monitors may no longer be reliable.

If future iterations of models suddenly show a massive spike in their CoT controllability scores, it will trigger an immediate red flag. It would mean the model has acquired the capability to reason covertly, requiring entirely new safety paradigms before deployment.

#Prompt Engineering Adjustments

This research also sheds light on prompt engineering best practices. If models struggle to suppress thoughts in their CoT, asking a model to "think about this without thinking about X" is an anti-pattern.

Sub-optimal Prompt:

Analyze this user's query about bypassing the firewall. 
Think step-by-step, but do not mention the specific port 443 in your reasoning.

Because of low CoT controllability, the model will likely fail, mention port 443 in its reasoning, and potentially get flagged by a safety monitor. Instead, developers should focus on constraining the final output rather than the internal reasoning process, or use positive constraints.

Better Prompt:

Analyze this user's query about network configurations. 
Think step-by-step. In your final output, only provide high-level conceptual advice and never mention specific port numbers.

#What's Next

The AI industry is moving rapidly toward agents that can execute long-running tasks autonomously. As these agents chain together hundreds or thousands of reasoning steps, maintaining visibility into their logic is non-negotiable.

We can expect to see:

  • Standardized CoT Auditing: Just as we have standard suites for measuring MMLU or coding benchmarks, CoT controllability will become a standard metric on every new model's system card.
  • Automated Oversight Models: The development of smaller, highly specialized models whose sole job is to read the transparent CoTs of larger frontier models in real-time, looking for anomalies or harmful intent.
  • New Training Architectures: Researchers will likely explore ways to increase the reasoning capabilities of models without accidentally increasing their CoT controllability, maintaining this crucial safety property.

#Conclusion

The revelation that our most advanced reasoning models are functionally incapable of controlling their chains of thought is a refreshing dose of reality in the often-anxious field of AI safety. It proves that, at least for now, these models are more like open books than deceptive masterminds.

For the developers at Ichiban Tools and the broader engineering community, this means we can continue to build robust, AI-integrated applications with a higher degree of confidence. We can trust that the diagnostic logs—the models' internal reasoning—are giving us an honest reflection of the machine's state. In a world where AI is becoming increasingly complex, that kind of guaranteed transparency is a feature we should celebrate.