Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning
Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
Large Language Models (LLMs) have demonstrated impressive advancements in complex reasoning tasks, significantly attributed to the adoption of Chain-of-Thought (CoT). CoT prompting guides models to generate intermediate reasoning steps, thereby enhancing performance in areas such as arithmetic problem-solving and commonsense reasoning [43, 17, 8]. Despite these improvements, CoT reasoning faces two fundamental challenges: (i) Sufficiency: ensuring that the generated intermediate steps comprehensively support the conclusion [50, 3, 33], and (ii) Necessity: identifying which steps are indispensable for the soundness of the final answer [7, 54]. Figure 1a illustrates three common reasoning patterns frequently observed in LLMs, exemplified here using a GSM-8k [10] question: (1) Sufficient but Unnecessary, where redundant steps reduce reasoning efficiency; (2) Necessary but Insufficient, in which incomplete reasoning fails to reach the correct answer; and (3) Sufficient and Necessary, the ideal case that balances correctness and conciseness. These examples highlight the impact of reasoning inefficiencies—especially “overthinking”, where unnecessary steps may hinder rather than help model performance.
Recent research on Chain-of-Thought (CoT) reasoning has addressed Sufficiency by introducing strategies such as self-consistency decoding [59] and iterative refinement methods like Self-Refine [40], aiming to ensure intermediate steps comprehensively support final answers [27, 48, 17]. Concurrently, efforts targeting Necessity have developed pruning techniques, such as addressing the “overthinking” by reducing the token length [7, 36]. Chain-of-Draft prompting [67] and identify critical reasoning steps [13], to reduce redundancy in reasoning paths [42, 62, 52, 45]. However, none have utilized rigorous mathematical analyses based on sufficient and necessary conditions [46] to evaluate and prune reasoning paths. These methods predominantly rely on correlation-based metrics (e.g., attention weights, likelihood scores, or ablation accuracy), which may misleadingly associate frequent or prominent steps with correctness without verifying true causal impact [4]. Consequently, correlation alone is insufficient for reliably distinguishing genuinely necessary or sufficient reasoning steps, highlighting the need for causal frameworks to rigorously assess their logical contributions. To jointly address the sufficiency and necessity of reasoning steps while ensuring logical and causal soundness, we introduce the concept of causal Probability of Necessity and Sufficiency (PNS) and redefine it for CoT reasoning framework. We theoretically analyse the identifiability of PNS in CoT.
Based on the identifiability results, we develop a PNS-based evaluation algorithm to systematically reconstruct reasoning sequences by causal intervention (rollout) (shown in Figure 1b). Using this algorithm, we effectively reconstruct CoT responses from training data that explicitly meet causal sufficiency and necessity criteria, thus eliminating redundant steps without compromising—and potentially enhancing—answer accuracy. The reconstructed reasoning CoT then serve as causally informed demonstrations, enabling LLMs to acquire causal reasoning capabilities via in-context learning and fine-tuning to improve the efficiency without sacrificing the accuracy. Empirical evaluations on mathematical reasoning benchmarks—including GSM-8k [10], MATH-500 [25], and AIME [44], as well as the CommonsenseQA [53] dataset—confirm that our approach significantly reduces reasoning redundancy while maintaining or improving prediction accuracy.