Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning

Paper · arXiv 2402.13950 · Published February 21, 2024
Reasoning CritiquesReward Models

Large language models (LLMs) have been shown to perform better when asked to reason step-by-step before answering a question. However, it is unclear to what degree the model’s final answer is faithful to the stated reasoning steps. In this paper, we perform a causal mediation analysis on twelve LLMs to examine how intermediate reasoning steps generated by the LLM influence the final outcome and find that LLMs do not reliably use their intermediate reasoning steps when generating an answer. To address this issue, we introduce FRODO, a framework to tailor small-sized LMs to generate correct reasoning steps and robustly reason over these steps. FRODO consists of an inference module that learns to generate correct reasoning steps using an implicit causal reward function and a reasoning module that learns to faithfully reason over these intermediate inferences using a counterfactual and causal preference objective. Our experiments show that FRODO significantly outperforms four competitive baselines. Furthermore, FRODO improves the robustness and generalization ability of the reasoning LM, yielding higher performance on out-of-distribution test sets. Finally, we find that FRODO’s rationales are more faithful to its final answer predictions than standard supervised fine-tuning.

Reasoning implicitly involves two steps: identifying the rules and facts (inference chains) necessary to reach a conclusion and then robustly using them to reach said conclusion (Levesque, 1986). Our paper studies whether LLMs reliably use inference chains to arrive at a conclusion.1 In standard CoT, LLMs can generate plausible explanations with the final answer not necessarily guaranteed to follow the reasoning chain or imply a causal relation between the reasoning chain and the model’s outcome (Lyu et al., 2023). Most recent efforts have either focused on the performance of LLMs on various reasoning tasks or their faithfulness in CoT generation, ignoring the sequential relationship between CoT and the final answer (Huang and Chang, 2023; Lanham et al., 2023).

In this work, we address this gap by introducing a methodology for interpreting the relationship between the CoT trace and the final answer based on causal mediation analysis (Pearl, 2001). Causal mediation analysis is a method of causal inference that studies the change in a response variable following an intervention or treatment. More concretely, we use this method to measure and interpret the contribution of a reasoning chain (mediator) to the final answer (observed output), as shown in Fig.1. We propose multiple interventions on the model inputs and mediators (reasoning chain) to unveil the causal effect of specific reasoning steps in a model’s output.

We apply this framework and study the causal impact of CoT rationales on the behaviour of twelve different state-of-the-art LLMs on three different complex reasoning tasks (mathematical, commonsense, and causal understanding). We observe a large variation across tasks and models in how strongly reasoning traces causally affect the model’s prediction. In particular, we find that instruction-tuned models (GPT-3.5-Instruct, Brown et al., 2020b; Mistral-Instruct-7B, Jiang et al., 2023b) have a stronger causal effect on the final answer when conditioned on the reasoning trace than models trained with RLHF (e.g., Chat- GPT; Llama-2-7B-Chat, Touvron et al., 2023). Similar to Turpin et al. (2023), when we intervene in the reasoning problem, we observe that ChatGPT and GPT-3.5-Instruct are inconsistent at generating plausible reasoning chains. Finally, we find GPT- 4 (Achiam et al., 2023) only changes its answer 30% of the time when conditioned on perturbed counterfactual reasoning chains. In Figure 1, we see one example where GPT-4 does not faithfully change its final answer when provided with intervened counterfactual CoT. These results indicate two issues: (i) LLMs can generate unfaithful and implausible reasoning chains, and (ii) LLMs are inconsistent when reasoning over their own generated reasoning traces.