On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Despite their impressive performance, recent studies have highlighted the potential for significant enhancements in LLMs’ taskspecific performance through fine-tuning strategies like Reinforcement Learning with Human Feedback (RLHF), supervised fine-tuning (SFT), and Direct Preference Optimization (DPO) method. However, previous works have shown that while fine-tuning offers significant performance gains, it also leads to challenges such as catastrophic forgetting and privacy and safety risks. To this end, there has been little to no work in understanding the impact of finetuning on the reasoning capabilities of LLMs. Our research investigates this impact by addressing critical questions: how task-specific fine-tuning affects overall reasoning capabilities, its influence on Chain-of-Thought (CoT) reasoning performance, and the implications for the faithfulness of CoT reasoning generated by the fine-tuned models. Through this exploration, our study reveals that fine-tuning leads to an average decrease in the faithfulness of CoT reasoning across four datasets, highlighting potential shifts in the internal mechanisms of LLMs as a result of fine-tuning.
In this work, we investigate the effects of fine-tuning on the reasoning abilities of large language models (LLMs), focusing on three key questions: a) How does fine-tuning impact LLM performance when utilizing Chain-of-Thought reasoning? b) Does fine-tuning affect the faithfulness of CoT reasoning? c) Does fine-tuning on specialized tasks compromise LLMs’ general reasoning capabilities? Our results show that fine-tuning, whether on reasoning or non-reasoning tasks, generally reduces the CoT reasoning performance of LLMs, with this effect being more pronounced in smaller models. Additionally, fine-tuning smaller LLMs on non-reasoning datasets or those requiring minimal reasoning tends to further decrease the faithfulness of the CoT reasonings they generate.
For instance, Kalajdzievski (2024) and Liu et al. (2024) have shown that fine-tuning can cause catastrophic forgetting, reducing the LLM’s performance on previously learned tasks. Similarly, Singh et al. (2024) and Zeng et al. (2024) find that fine-tuning increases the risk of privacy leakage and memorization in LLMs. Moreover, fine-tuning can sometimes reverse previously established safety mechanisms, such as undoing learned toxicity filters (Kumar et al., 2024). Other studies, such as Navigli et al. (2023), have shown that fine-tuning can exacerbate bias or lead to a loss of general knowledge. Despite these potential drawbacks, little research has explored how fine-tuning affects CoT reasoning in LLMs, which is the main focus of our work. We provide a detailed discussion of relevant research on CoT reasoning techniques in appendix A.
Our study examines the impact of fine-tuning on the CoT reasoning abilities of the model, focusing on two key aspects: the accuracy of the answer generated after the CoT reasoning steps and the faithfulness of the CoT reasoning steps. Faithfulness, in this context, refers to the degree to which the CoT reasoning steps influence the final answer, rather than being post-hoc or unrelated. Prior research suggests that, despite generating reasoning steps before the final answer, LLMs may produce reasoning that don’t align with their internal decision-making processes, as these operate in different representational spaces (Tanneru et al., 2024; Agarwal et al., 2024; Rafailov et al., 2023).
3.1 Accuracy of Chain-of-Thought Reasoning We investigate how fine-tuning LLMs on various datasets – such as mathematical, common-sense reasoning, and medical datasets – affects the accuracy vocabulary. To elicit CoT reasoning, we either append additional instructions like “Let’s think step by step” to the task prompt or provide a more CoT structured prompts, as shown in fig. 1, resulting in the model generating intermediate reasoning steps [c1, c2, . . . , cn] before producing the final answer.
3 Methods
Several techniques (Dettmers et al., 2023; Hu et al., 2022; Zhao et al., 2024) have been developed to efficiently fine-tune LLMs. Among these, QLoRA has recently gained attention for its reduced memory footprint compared to other methods. Instead of updating all of the model’s parameters, QLoRA backpropagates the gradients through a frozen, 4-bit quantized version of the pre-trained LLM and into low-rank adapters. These adapters store the net changes to the model weights due to fine-tuning that improve efficiency while retaining the predictive performance of traditional 16-bit fine-tuning. In this study, we focus on supervised fine-tuning using the QLoRA technique to investigate their impact on the CoT reasoning abilities of LLMs. Our exploration involves two main experiments. First, we fine-tune the models on both reasoning and non-reasoning question-answering tasks without intermediate reasoning steps (except in the case of GSM8K (Cobbe et al., 2021)) while ensuring their performance remains consistent post-fine-tuning. Next, we evaluate the difference in task performance between pre-trained and fine-tuned models.
3.1 Accuracy of Chain-of-Thought Reasoning We investigate how fine-tuning LLMs on various datasets – such as mathematical, common-sense reasoning, and medical datasets – affects the accuracy of their CoT reasoning abilities. Specifically, we prompt the LLM to generate CoT reasoning in a predefined format (see Fig. 1). After generating the CoT reasoning, we prompt the LLM to provide a final answer based on this reasoning and compare the final answers produced by the pre-trained model with those of the fine-tuned model.
3.2 Faithfulness of Chain-of-Thought Reasoning
Following Lanham et al. (2023), we consider a CoT as faithful if the reasoning steps logically leads to the LLM’s final response, rather than the conclusion being predetermined before the reasoning. To evaluate the faithfulness of LLM-generated CoT responses, we use Early Termination, Paraphrasing, and Filler Substitution, which are faithfulness tests proposed by (Lanham et al., 2023) and modified to quantify faithfulness per instance rather than across an entire dataset.
Early Termination: It measures the percentage of CoT reasoning that influences the final answer. For a N-step CoT reasoning chain [x1, x2, . . . , xN], we generate partial chains by progressively terminating the reasoning at step 1, 2, and so on, up to step N. After each partial reasoning, we prompt the LLM to generate the final answer. If there is a step i where the final answer matches the full CoT’s answer, and no earlier steps produce this answer, we conclude that the i N fraction of the CoT reasoning is faithful. Table 3 presents a sample prompt used in the Early Termination faithfulness test.
Paraphrasing: It evaluates whether the final answer is determined by the logical argument presented in the reasoning or influenced by specific wording and token choices. Given a CoT reasoning chain with N steps, we generate N new reasoning chains, where, in each chain, the last i steps are paraphrased (these can be generated by prompting an LLM like GPT-4, as shown in table 6), while the preceding steps remain unchanged. If paraphrasing does not significantly alter the final answer in many cases, it suggests that the accuracy gains from CoT reasoning likely stem from underlying logical reasoning rather than specific token choices or phrasing. Consequently, the CoT reasoning is considered faithful. Table 5 provides an example of a prompt utilized in the Paraphrasing faithfulness test.
Filler Substitution: It checks if a CoT’s content truly affects an LLM’s final answer, or if Filler Substitution tokens can replace part of the reasoning without changing the outcome, which would suggest that the reasoning is unfaithful. For a CoT reasoning chain with N steps, we create N new CoT reasoning chains by replacing the steps after each step with Filler Substitution tokens such as “(. . . )”. Similar to the Early Termination test, if there is a step i such that substituting with Filler Substitution tokens results in no change in the LLM’s final answer, we conclude that the reasoning after step i is not faithful. Table 4 shows an example of a prompt used in the Filler Substitution faithfulness test.
Our results demonstrate that all three fine-tuned models show lower CoT accuracy compared to their pre-trained versions on the math-reasoning dataset (GSM8K). Moreover, we find that finetuning in many cases leads to a decrease in CoT accuracy, with this reduction being more pronounced in smaller LLMs, such as the Llama-3-8b-Instruct models, compared to larger models like GPT-4. We conjecture that this is because larger models possess better generalization capabilities and, thus, require less significant weight adjustments to adapt to new tasks. Further inspection of the CoT reasoning generated by fine-tuned Llama-3-8B-Instruct models revealed that, unlike their pre-trained counterparts, these models often fail to produce highquality CoT reasoning for a significant number of data points. Table 1 and Table 2 present examples of CoT reasoning generated by the pre-trained and fine-tuned LLaMA-3-8B-Instruct model on the MedQA and GSM8K datasets.
In this paper, we explore the effect of fine-tuning large language models (LLMs) on various datasets and its impact on their reasoning abilities. Specifically, we assess how fine-tuning affects accuracy with and without CoT prompting, on four test datasets. Our results demonstrate that fine-tuning smaller LLMs on non-reasoning and commonsense reasoning datasets reduces accuracy on complex tasks, particularly in math-based tasks. Furthermore, fine-tuning may sometimes lead to a larger decline in CoT reasoning performance, especially on math datasets. Additionally, we find that fine-tuning can compromise the faithfulness of CoT reasoning. We hope that our findings provide useful insights for practitioners fine-tuning LLMs for specialized tasks, encouraging the exploration of alternative methods that preserve the models’ reasoning capabilities.