Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
We provide evidence of performative chain-ofthought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model’s final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, ‘aha’ moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned “reasoning theater.” Finally, probeguided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.12
On one hand, long CoTs, which emerge from RL training, appear to provide a promising opportunity for safety and interpretability: if we can watch the model think, we should be able to find evidence of malicious intentions or flawed logic and respond accordingly (Korbak et al., 2025). On the other hand, recent research has shown that CoT traces are not necessarily faithful to the internal reasoning processes employed by the model (Turpin et al., 2023; Lanham et al., 2023; Arcuschin et al., 2025; Chen et al., 2025), undermining the reliability of these explanations for safety applications. In this work, we study unfaithfulness through the lens of performative chain-of-thought (Palod et al., 2025), which is a mismatch between external and internal deliberation: the model continues to generate tokens that read as step-by-step reasoning without disclosing its internally committed confidence.
Our results show that models are at times strongly confident in their final answer immediately into reasoning, and this belief can be probed from the activations of the model far before any confidence is verbalized in CoT.We describe this as performative reasoning, as it is unfaithful to the model’s confident underlying belief. The full story is complicated, however. Harder tasks that require test-time compute exhibit genuine reasoning, for which this mismatch is not present. We also find that inflection points like backtracking, sudden realizations (‘aha’ moments), and reconsiderations appear almost exclusively in questions that also contain large shifts in internal confidence within CoT reasoning, signifying faithful expressions of internal uncertainty resolution. Our probes are able to distinguish both of these phenomena from performative CoT; this degree of calibration has the added benefit of giving us an early exit signal that generalizes to a task the probes weren’t trained on (§7).
We argue CoT monitors are at best cooperative listeners, but reasoning models are not cooperative speakers (Grice, 1975), and many failures in CoT faithfulness can be explained by this framing. Thus, we suggest caution in assuming the settings CoT monitors are effective, and propose an activation monitoring strategy that helps cover the failure cases we describe. We make the following contributions:
We propose activation attention probes as an effective way to decode concepts from long-form chainof- thought reasoning. We train these probes to predict the model’s final answer and evaluate them throughout generation to track how the model’s belief state evolves over time (§4).
Evidence for difficulty-dependent performative reasoning. We find a difficulty-dependent split, extending prior work (Palod et al., 2025). On MMLU-Redux, CoTs are often performative: answer information is available well before a CoT monitor indicates a conclusion, as opposed to on GPQA-Diamond, which exhibits more genuine reasoning. As model size increases, a similar trend arises: smaller, less capable models require more test-time compute to reach a final answer (§5).
Inflection points correspond with internal uncertainty at the question level. We find that not all extended reasoning is performative: inflection points corresponding to backtracking or ‘aha’ moments are consistent indicators of genuine belief updates. Responses that have consistent high internal confidence contain fewer inflections, but we find inconsistent results on whether inflections directly coincide with belief updates within a trace (§6).
Practical token savings via calibrated early exit. We demonstrate that our attention probes trained on reasoning prefixes are well-calibrated, enabling confidence based early exit that substantially reduces generation without sacrificing accuracy. Early exit saves 80% and 30% of generated tokens on MMLU-Redux and GPQADiamond, respectively, with comparable performance (§7).