Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs

Paper · arXiv 2502.02362 · Published February 4, 2025

Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reasoning chains can be long, making it harder to verify the reasoning steps and trace issues resulting from dependencies between the steps that may be farther away in the sequence of steps. Importantly, mathematical reasoning allows each step to be derived from a small set of premises, which are a subset of the preceding steps in the reasoning chain. In this paper, we present a framework that identifies the premises for each step, to improve the evaluation of reasoning. We restructure conventional linear reasoning chains into Premise Augmented Reasoning Chains (PARC) by introducing premise links, resulting in a directed acyclic graph where the nodes are the steps and the edges are the premise links. Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-by-step verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.

Existing research focusing on the evaluation of reasoning chains in LLMs can be broadly categorized into reference-based and reference-free methods. Reference-based methods, which rely on the availability of a ground truth reasoning chain (Welleck et al., 2021; Han et al., 2024; Tian et al., 2021), are reliable but are constrained by the significant cost of human annotations, restricting their application. In contrast, reference-free methods (Prasad et al., 2023; Golovneva et al., 2023; Zhu et al., 2024) bypass the need for annotations, but suffer from two major drawbacks: (1) most such works assign only a chain-level score rather than localizing specific errors, and (2) they often need fine-tuning of task-specific models, restricting their generalizability. One workaround could be using formal proof assistants like Lean (Moura & Ullrich, 2021), which natively support the verifiability of generated proofs (Yang et al., 2024; 2023; Murphy et al., 2024), but this requires a challenging auto-formalization of natural language text (Wu et al., 2022) and often assumes the solution is already known for proof construction.

We restructure conventional linear reasoning chains (LRCs) into PARC (Premise-Augmented Reasoning Chains), by identifying the premises of each step. Identifying those premises improves the traceability of a reasoning chain. We create a corresponding dataset, called PERL (Premises and ERrors identification in Language models) to test LLMs’ capability in identifying premises as well as the effectiveness of premise augmentation in annotating both premises and errors. Additionally, we refine the error taxonomy (as proposed in (Golovneva et al., 2023)) for mathematical reasoning by introducing a new category: “accumulation error.” This error arises when a reasoning step is correct in isolation but is derived from flawed premises