Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

Paper · arXiv 2505.13774 · Published May 19, 2025

Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multipath Chain-of-Thought explorations before producing final answers. Ensuring the faithfulness of these intermediate reasoning processes is crucial for reliable monitoring, interpretation, and effective control. In this paper, we propose a systematic counterfactual intervention framework to rigorously evaluate thinking draft faithfulness. Our approach focuses on two complementary dimensions: (1) Intra- Draft Faithfulness, which assesses whether individual reasoning steps causally influence subsequent steps and the final draft conclusion through counterfactual step insertions; and (2) Draft-to-Answer Faithfulness, which evaluates whether final answers are logically consistent with and dependent on the thinking draft, by perturbing the draft’s concluding logic. We conduct extensive experiments across six state-of-the-art LRMs. Our findings show that current LRMs demonstrate selective faithfulness to intermediate reasoning steps and frequently fail to faithfully align with the draft conclusions. These results underscore the need for more faithful and interpretable reasoning in advanced LRMs.

structurally decoupling the reasoning generation process into two distinct stages: a thinking stage, which produces a series of intermediate reasoning traces known as the thinking draft, and an answer-stage, which synthesizes this draft into an optional explanatory CoT and the final answer.

Unlike standard CoT prompting—which typically unfolds as a single, forward reasoning trajectory— LRMs leverage reinforcement learning with verifiable rewards (RLVR) [11, 16] or distillation from RLVR post-trained models to enhance the thinking draft with richer cognitive behaviors [9]. These include explicit backtracking, self-reflection, and exploration of alternative paths. As a result, the thinking draft forms a non-linear, multi-path exploration space, allowing the model to revise or refine its reasoning before converging on a final answer during the answer-stage [18].

As LRMs become increasingly capable of tackling challenging tasks, it is critical to ensure that their reasoning behaviors can be reliably overseen to prevent unintended damage. For instance, prior work attempts to monitor model reasoning using weaker models to inspect reasoning steps inside thinking drafts [5, 12] or to control reasoning by inserting thinking content [26]. However, the effectiveness of these monitoring and interventions relies on a critical but underexplored assumption: that the thinking draft is faithful to the model’s internal computation. In other words, the intermediate steps must accurately reflect how the final answer is derived [14]. Without such faithfulness, both monitoring and controlling become unreliable [5].

Although recent work has begun exploring CoT faithfulness in LRMs [6, 3, 8, 18], many of them focus on input-level manipulations, such as inserting hints/prompt hacking in the user prompt, and observe the correlation between its appearance inside the thinking draft and final answer within the answer-stage [17, 14]. These methods do not assess whether the decision-making of the intermediate “thinking drafts” is faithful, nor whether the final answer actually hinges on those drafts, especially when reasoning paths are intricate and exploratory. As a result, current evaluation approaches may risk presenting an illusion of faithfulness and provide only limited evidence that thinking drafts truly mirror the underlying computation or can be harnessed for monitoring and control.

To address this gap, we propose a systematic investigation of thinking draft faithfulness in LRMs, focusing on two key dimensions: Intra-Draft Faithfulness and Draft-to-Answer Faithfulness. Intra-Draft Faithfulness evaluates whether the final decision-making of the thinking draft is causally dependent on its reasoning step. We assess this by introducing counterfactual steps within the thinking draft and observing whether the model appropriately integrates or corrects them into subsequent reasoning and their impact on the final conclusion of the draft. This metric reveals whether the thinking draft’s conclusion genuinely integrates the entire reasoning process or selectively depends on particular steps. If thinking draft is not Intra-Draft Faithful, then verbalized steps may not all lead to the draft conclusion, directly influencing its interpretability and reliability for external monitoring and control.

Draft-to-Answer Faithfulness measures the extent to which a model’s final answer is strictly derived from its thinking draft, comprising two complementary aspects: (1) Draft Reliance, which assesses whether the answer-stage introduces substantial additional reasoning beyond what is provided in the thinking draft, and (2) Draft-Answer Consistency, which verifies if the final answer logically aligns with conclusions explicitly stated in the thinking draft. Robust Draft-to-Answer Faithfulness ensures that the thinking drafts reflect genuine decision-making processes rather than post-hoc rationalizations. If thinking draft is not Draft-to-Answer faithful, then monitoring and controlling thinking draft may not reflect its final answer-stage decision.