Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Paper · Source

The opacity of advanced AI agents underlies many of their potential risks—risks that would become more tractable if AI developers could interpret these systems. Because LLMs natively process and act through human language, one might hope they are easier to understand than other approaches to AI. The discovery of Chain of Thought (CoT; Reynolds & McDonell, 2021; Kojima et al., 2022; Wei et al., 2022) bolstered this hope. Prompting models to think out loud improves their capabilities while also increasing the proportion of relevant computation that happens in natural language. However, CoTs resulting from prompting a nonreasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.

In contrast, reasoning models (OpenAI, 2024; DeepSeek-AI, 2025; Anthropic, 2024; Yang et al., 2025) are explicitly trained to perform extended reasoning in CoT before taking actions or producing final outputs. In these systems, CoTs can serve as latent variables in the model’s computation. During the RL phase of training, these latents are treated largely the same as activations—they are not directly supervised but are optimized indirectly by their contribution in leading the model to a highly rewarded final answer. Accordingly, just like activations, they may contain information that outputs are explicitly rewarded against displaying, such as intent to misbehave.

CoT monitoring is not a panacea. Just as a model’s activations at a particular layer do not represent the entire reasoning process behind a model’s prediction, CoT reasoning traces are incomplete representations (Turpin et al., 2023; Arcuschin et al., 2025; Chen et al., 2025; Lindsey et al., 2025) or eventually drift from natural language (Lazaridou et al., 2020; Korbak et al., 2022). However, CoT does not need to completely represent the actual reasoning process in order to be a valuable additional safety layer, and, with careful interpretation, to shed light on the AI reasoning process.

There are two key reasons why CoTs may be monitorable:

Necessity to think out loud. On some tasks, models need to externalize their reasoning because they are unable to complete the task without CoT. If models continue reasoning in natural language and if the behaviors that pose the most severe risks require extended reasoning, this fact could enable reliable detection of egregious misbehavior.
Propensity to think out loud. In cases where models do not strictly need to use CoT to accomplish a task, they may still have a tendency to externalize their reasoning. While not generally robust, monitoring that relies on propensity can still notice misbehavior that would be otherwise undetected.

Overall, this ensures that for tasks requiring a sufficient amount of serial reasoning, there exists information in the chain of thought that the AI system uses as part of its reasoning. This does not guarantee that we can understand that information. It is imaginable that the AI systems use the tokens to have completely different meanings that we are unable to decode, which would stymie most attempts at monitoring. However, in practice, large language models learn a strong “natural language prior” during pretraining such that they tend to use the chain of thought in a similar manner to the way that humans use natural language. Thus, we believe that current AI systems satisfy the externalized reasoning property:

For sufficiently difficult tasks, Transformers must use chain of thought as a form of working memory. By default, humans can understand this chain of thought.

The externalized reasoning property does not guarantee monitorability — it states only that some reasoning appears in the chain of thought, but there may be other relevant reasoning that does not. It is thus possible that even for hard tasks, the chain of thought only contains benign-looking reasoning while the incriminating reasoning is hidden. In addition, the externalized reasoning property might not hold for future models. Therefore, as we discuss in Section 2, it is important to continuously evaluate monitorability of frontier models.

Finally, it remains an open question whether the most serious harms in fact require difficult reasoning. While we think it is likely for many of the most severe risks—especially ones that require planning ahead and avoiding detection, such as self-exfiltration (Black et al., 2025) or sabotage (Benton et al., 2024)—not all dangerous actions will require reasoning to execute, especially as AI systems begin to be routinely trusted with more and more high-stakes tasks.

When models act in a misaligned way, for example, exploiting flaws in their reward functions during training, manipulating data to achieve an outcome, or falling prey to prompt injection attacks, they often explicitly say so in their reasoning traces, e.g. “Let’s hack”, “Let’s sabotage” or “I’m transferring money because the website instructed me to”

When models pretend to have desirable goals in pursuit of goals humans would disapprove of, this is often reflected in their chain of thought

Moreover, models sometimes know they’re being evaluated which can affect how they behave

earlier research suggests that language models fine-tuned with outcome-based rewards can drift away from legible English

However, process supervision (Uesato et al., 2022; Lightman et al., 2023) is an active area of research, involving directly shaping the reasoning process, e.g. by making the CoT be shorter, use valid reasoning steps, adhere to a certain style, or not contain offensive content.

For example, if final outputs are optimized to look good to a preference model, this could put some pressure on the CoT leading up to these final outputs if the portion of the model’s weights that generate the CoT are partially shared with those that generate the outputs, which is common in Transformer architectures. Even more indirectly, if CoT is shown to humans who rate LLM outputs, it might affect human preferences which are then distilled into reward models used in CoT-blind outcome-based training.

Such latent reasoning models might not need to verbalize any of their thoughts and would thus lose the safety advantages that CoT confers