Do chain of thought traces actually help humans understand reasoning?
When models show their work through chain of thought traces, do humans find them interpretable? Research tested whether the traces that improve model performance also improve human understanding.
A common assumption behind CoT traces: they serve as explanations. The model shows its work, users can follow the reasoning, trust is established. This assumption turns out to be wrong in a specific and quantifiable way.
Empirical findings from a 100-participant human-subject study:
- R1 traces: highest final solution accuracy, lowest human interpretability ratings
- Algorithmically-generated semantically correct traces: lowest performance despite being verifiably correct
- LLM-generated summaries of R1 traces: better interpretability, intermediate performance
The traces that are most useful for the model to generate correct answers are least useful for humans trying to understand those answers. The two objectives pull in opposite directions.
The mechanism: CoT traces used for SFT are optimized to be a training signal — to push the model toward correct token sequences through backpropagation. The properties that make a trace useful for training (complex recursive structure, non-linear exploration, self-doubt and revision cycles) are exactly the properties that make it cognitively opaque to humans.
This has a design implication that some systems are already acting on: GPT-OSS models generate a CoT trace (for model performance), a summary (for human communication), and a final answer. The trace is not shown to users. This separation acknowledges the decoupling.
The implication for AI transparency: showing users CoT traces is not showing them how the model reasons. It is showing them the model's training scaffold. What users need is a summary; what models need is the trace. Conflating the two in the name of "explainability" produces outputs that feel transparent without providing genuine interpretability.
This is a distinct claim from Do reasoning traces actually cause correct answers? — that note warns against inferring intentional reasoning from traces. This note adds: even if you don't anthropomorphize, the traces are the wrong artifact for human interpretability. Both wrong in different ways.
Source: Reasoning Critiques
Related concepts in this collection
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
traces are not verified reasoning AND are not human-interpretable; two separate failures
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
causal faithfulness and user interpretability are both absent; neither is guaranteed by the presence of a trace
-
Why do models trust their own generated answers?
Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.
models can't evaluate their own reasoning; neither can users from raw traces
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
explains why the decoupling exists: if CoT is constrained imitation of reasoning patterns from training data, traces are optimized to continue familiar token sequences (model performance) not to explain the reasoning process to humans (interpretability)
-
Does fine-tuning weaken how reasoning steps influence answers?
When models are fine-tuned on domain-specific tasks, do their chain-of-thought reasoning steps actually causally drive the final answer, or do they become decorative? This matters because accurate outputs can mask unfaithful reasoning.
fine-tuning exacerbates both the faithfulness and interpretability dimensions: if traces are already decoupled from model performance (this note), and fine-tuning further decouples reasoning steps from final answers (faithfulness degradation), then post-fine-tuning traces serve neither the model nor the user
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
the SFT accuracy trap creates the conditions for the performance-interpretability decoupling: accuracy optimization selects for traces that drive correct outputs rather than traces that explain reasoning, directly producing the divergence documented here
-
Can model explanations help humans predict what models actually do?
Do explanations that sound plausible to humans actually help them forecast model behavior on new cases? Understanding this gap matters because RLHF optimizes for plausible explanations, not predictive ones.
provides the metric-level evidence for this architectural decoupling: explanation precision (can users predict model behavior from explanations?) is uncorrelated with plausibility (do explanations look good?), confirming that RLHF-style optimization improves appearance without improving functional utility
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
cot traces optimize model performance, not user interpretability — the two objectives are decoupled