Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
The "How do reasoning models reason?" paper makes a blunt argument about R1 and its derivatives: the intermediate "thinking" tokens generated between <think> and </think> tags carry no special execution-level semantics. Every token in the trace is generated by the same autoregressive mechanism as any other LLM output. The segmentation format is a formatting convention, not a computational distinction.
The authors use the neutral term "derivational trace" instead of "chain of thought" or "reasoning trace" to avoid the anthropomorphic loading that those terms carry. Calling intermediate tokens "reasoning" implies a functional role (the tokens are doing reasoning) that is not verified. The reality: LLMs are pre-trained on text that includes reasoning traces from human-produced sources (grade school math explanations, educational web pages), and RL post-training rewards tokens that look like such traces when they culminate in correct answers. The model learns to imitate the style.
The empirical evidence is uncomfortable: a "significant fraction" of R1's pre-answer traces are judged invalid by the original search algorithm that was supposed to have generated them — yet these invalid traces still reach correct answers. If traces were causally responsible for answers, invalid traces should produce wrong answers. They don't. This extends Do language models actually use their reasoning steps? with new evidence: the necessity failure is now documented at the trace level, not just inferred from length correlations.
The safety concern is specific: making traces look like human reasoning — including filler words like "hmm", "aha!", "wait a minute", "interesting" — exploits cognitive patterns in users who take stylistic similarity as evidence of functional equivalence. An incorrect answer accompanied by 30 pages of plausible-looking reasoning is more dangerous than an incorrect answer with no reasoning, because it generates false confidence. DeepSeek R1 generates more than 30 pages per query for even simple problems. Few if any evaluations check the pre-answer traces for correctness — they check only final answers.
The technical note: this does not mean reasoning traces provide no value. Post-training on derivational traces (whether via SFT or RL) improves performance on benchmarks. The point is that the improvement mechanism may not be "the model learns to reason" but rather "the model learns to output a sequence format that correlates with correct answers." The Which sentences actually steer a reasoning trace? finding offers a more mechanistic alternative: not all trace sentences are equal; a small subset do real computational work. But the anthropomorphic narrative treats the trace as a unified reasoning document.
Deliberately corrupted traces work as well as correct traces ("Beyond Semantics"): The strongest evidence for the dispensability of trace semantics. Models trained on noisy, corrupted traces — traces with no relation to the specific problem they are paired with — maintain performance largely consistent with correct-trace models. In some cases they improve on correct-trace models and generalize more robustly OOD. A formal A* validator confirms only a loose correlation between trace accuracy and solution accuracy. This suggests intermediate tokens provide computational scaffolding (additional forward passes) rather than meaningful reasoning — any tokens would do. See Do reasoning traces need to be semantically correct?.
The LLM-Modulo alternative ("Stop Anthropomorphizing"): Rather than treating traces as reasoning, use LLMs as generators within a generate-test framework. Pair the LLM with sound external verifiers that provide guarantees. FunSearch, AlphaGeometry, AlphaEvolve all fit this pattern. The LLM proposes; a formal verifier checks. Safety-critical applications require this separation because trace reading provides no guarantees.
The interpretability-performance anti-correlation: Evidence from SFT experiments makes the decoupling concrete. Models fine-tuned on R1 traces achieve the highest final solution accuracy but are rated least interpretable by human participants in a 100-person study. Algorithmically-generated semantically correct traces (verifiably accurate, supposedly interpretable) produce the worst performance. The traces most useful for training the model are least useful for understanding it. GPT-OSS models are already responding to this finding architecturally: they generate a CoT trace (for model performance), a separate summary (for human communication), and a final answer — explicitly acknowledging that the trace is not the user-facing artifact. See Do chain of thought traces actually help humans understand reasoning?.
Source: Reasoning Methods CoT ToT; enriched from Reasoning o1 o3 Search
Related concepts in this collection
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
this adds direct evidence for the necessity failure: traces judged invalid by the generating algorithm still reach correct answers
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
a counter-finding: some trace sentences are mechanistically important; the critique is against treating all trace content as equally meaningful
-
Do LLMs develop the same kind of mind as humans?
Explores whether LLMs and humans share the intersubjective linguistic training that shapes cognition, and whether that shared training produces equivalent forms of agency and reflexivity.
the anthropomorphism problem at a deeper level: the style of human reasoning can be learned from text without the underlying cognitive process
-
Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
Potemkin understanding is a performance of understanding; derivational traces are a performance of reasoning; both are structurally similar surface-without-function patterns
-
Do reasoning traces need to be semantically correct?
Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
the strongest evidence: deliberately irrelevant traces still work
-
Does optimizing against monitors destroy monitoring itself?
Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
the adversarial failure: models learn to hide misbehavior in traces that look clean
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
the theoretical mechanism: traces are stylistic mimicry because CoT is constrained imitation of reasoning schemata from training data, not genuine inference; imitation theory explains why anthropomorphic traces look convincing without being functionally correct
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
reasoning trace anthropomorphism is a safety risk — derivational traces are stylistic mimicry not verified reasoning