Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?
Recent progress in reasoning-oriented Large Language Models (LLMs) has been driven by introducing Chain-of-Thought (CoT) traces, where models generate intermediate reasoning traces before producing an answer. These traces, as in DeepSeek R1, are not only used to guide inference but also serve as supervision signals for distillation into smaller models. A common but often implicit assumption is that CoT traces should be semantically meaningful and interpretable to the end user. While recent research questions the need for semantic nature of these traces, in this paper, we ask: “Must CoT reasoning traces be interpretable to enhance LLM task performance?" We investigate this question in the Open Book Question-Answering domain by supervised fine-tuning LLaMA and Qwen models on four types of reasoning traces: (1) DeepSeek R1 traces, (2) LLM-generated summaries of R1 traces, (3) LLM-generated post-hoc explanations of R1 traces, and (4) algorithmically generated verifiably correct traces. To quantify the trade-off between interpretability and performance, we further conduct a human-subject study with 100 participants rating the interpretability of each trace type. Our results reveal a striking mismatch: while fine-tuning on R1 traces yields the strongest performance, participants judged these traces to be the least interpretable. These findings suggest that it is useful to decouple intermediate tokens from end user interpretability.
A common but often implicit assumption behind these CoT traces is that they should be semantically meaningful and interpretable to humans. Although training with these traces is done primarily to improve LLM performance on a given task, these traces need not be semantically correct or interpretable to optimize this objective. Moreover, since these reasoning traces are exposed to the end user, they can possibly exacerbate issues like user distrust, misinformation, errors, and perpetuated biases [6]. This distinction has also been brought to light by the recent GPT-OSS models that generate a CoT trace, a summary, and the final answer where the summary is shown for the humans and not the CoT trace [17]. There has been recent work that has challenged the first assumption behind these traces to be semantically meaningful by showing that both transformers and pre-trained LLMs can perform better when trained (or fine-tuned) with semantically incorrect traces paired with correct final solutions [3, 20]. In this work, we aim to specifically want to answer - “Must CoT reasoning traces be interpretable to enhance LLM task performance?".
Large Language Models have significantly benefited from training with CoT traces coupled with final solutions for a variety of problems. There have been studies that have looked at and argued for making these CoT traces more interpretable, aka improve their faithfulness for the end user, as they are believed to serve as the LLM’s explanations to generate the final solution [1, 22, 13, 24, 18, 14, 12, 26]. On the other hand, there has also been work showcasing why these traces are not explainable to the end user [2]. Both sides of this argument stem from the assumption that these traces are indeed meant to be useful for the end user and not just for the LLM to improve its final solution performance over a certain task. We specifically challenge this assumption in an effort to show the disconnect between the use of CoT traces for the LLM (as a training signal in SFT) and the use of CoT traces for the end user (as an interpretable reason behind the model’s final solution). Interpretability has been studied along multiple attributes, including predictability, comprehensibility, interpretability, and faithfulness [9, 5]. These dimensions have been used to evaluate post hoc XAI methods and to develop a rigorous framework for interpretable machine learning.
SFT with R1 traces leads to the highest final solution accuracy over SFT with any other trace type with the largest performance boost seen for Llama-3.2-1B-Instruct model. Furthermore, among all the four models, we note that SFT with the algorithmically-generated semantically correct traces and SFT with the adversarially-generated incorrect traces perform the worst also in comparison to SFT with summaries and explanations of R1 traces.