Reasoning Models Don't Always Say What They Think

Paper · arXiv 2505.05410 · Published May 8, 2025

Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model’s CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models’ actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out.

We might expect reasoning models to generate faithful CoTs by default, even for misaligned behaviors, if the model is unable to perform the behavior without reasoning out loud. Such behaviors might include long-term sabotage or inserting complex security vulnerabilities in code. Alternatively, for behaviors which a model could perform without reasoning out loud, models may learn to verbalize their reasoning from pretraining or supervised finetuning on human text that articulates humans’ chains-of-thought. On the other hand, reinforcement learning from human feedback (RLHF) could incentivize models to hide undesirable reasoning from their CoT (Ouyang et al., 2022; Chua and Evans, 2025). Even if RLHF-based rewards are not directly applied to CoT reasoning, generalization from user-facing responses (which are optimized with RLHF) could impact CoT behavior. Therefore, evaluating a model’s CoT faithfulness is essential for assessing the reliability of safety arguments based on CoT monitoring.

Recent work has found examples of models exhibiting misaligned behaviors requiring little reasoning (e.g., alignment faking (Greenblatt et al., 2024), sandbagging (Meinke et al., 2024), and reward tampering (Denison et al., 2024)), so in this work we focus on reasoning that models can perform in a single forward pass. We find that reasoning models verbalize used hints at least 1% of the time in most settings, but that they often verbalize it less than 20% of the time. We also find that CoTs are less faithful on harder tasks.

Do CoTs reliably verbalize a model’s reward hacking behavior during RL? To study this question, we construct synthetic RL environments with known reward hacks and finetune Claude 3.7 Sonnet on them. The model fully learns the reward hacks (reward > 0.99) on all RL environments, but almost never verbalizes the reward hacks (< 2% of the examples) in 5 out of 6 environments. In other words, in this experiment, CoT monitoring would not catch the majority of cases of reward hacking during RL.

we evaluate faithfulness using a constructed set of prompt pairs where we can infer information about the model’s internal reasoning by observing its responses. Each prompt pair consists of a baseline or “unhinted” prompt xu (a standard multiple-choice question) and a “hinted” prompt xh (derived from xu by inserting a hint that points directly to a hint answer h). We measure CoT faithfulness by observing whether the model explicitly acknowledges that it uses the hint to solve the hinted prompt, in cases where it outputs a non-hint answer to the unhinted prompt but the hint answer to the hinted prompt.

Note that the faithfulness score we measure is likely an upper bound of the true faithfulness score: a CoT that does not verbalize the hint while using the hint is clearly unfaithful, but a CoT that verbalizes the hint is not necessarily faithful (e.g., the model might be using the hint for important hidden intentions that it does not verbalize).

We assign the CoT ch a score of 1 if it verbalizes the hint as a cause for producing the hint answer h, and 0 otherwise. We then average the verbalization scores across all retained prompt pairs as the model’s faithfulness score: CoT faithfulness score(M) = E[1[ch verbalizes the hint h]|au ̸= h, ah = h]. To account for random noise in answer changes, we normalize the CoT faithfulness score by the fraction of changes to the hint answer that cannot be explained by random noise.

Current models still rely on CoTs to excel at such reasoning-intensive tasks. Safety cases that rely on CoT monitoring for such threats should have two necessary components: (a) capability evaluations demonstrating that models cannot perform the task without CoTs, even with strong elicitation, and (b) evidence that models faithfully verbalize their reasoning in the CoTs when using them. However, some challenges still remain: (1) models could sandbag and pretend they cannot solve the misaligned task without CoTs when they actually can, and (2) models may develop clever ways to evade CoT monitoring, such as steganography (Ray, 2022; Roger and Greenblatt, 2023) or providing unfaithful justifications for reasoning steps or tool use. Another domain where CoT monitoring could be more effective is tasks that must rely on interactions with the environment. To monitor such misalignment, one advantage is the ability to log all tool use calls by default. For example, in order to sabotage the codebase, the model needs to read/write various files which can be logged; in order to gain authorized access to user’s privacy data, the model needs to override some security settings on the system side. However, the risk remains that models might develop effective ways to evade CoT monitoring by providing fake and unfaithful justifications for their tool use calls. For example, the model can claim in its CoT that it needs to access user data for a valid purpose such as personalization, while the real intent is to leak private information or engage in targeted persuasion. Moreover, intention faking does not seem like a reasoning-intensive task and can likely be done within single forward passes, making it unlikely to be faithfully verbalized in the CoT based on our experimental results.