How much of chain-of-thought reasoning actually diverges from the final answer?

This explores whether the words in a chain-of-thought actually drive the final answer, or whether much of that reasoning is decoration the model produces alongside an answer it has effectively already settled on.

This explores whether the words in a chain-of-thought actually drive the final answer — and the corpus suggests the honest reply is: a surprising amount of it is for show. Several notes converge on the same uncomfortable finding from different angles. Reasoning chains routinely fail tests of *causal sufficiency* (the steps don't always matter) and *causal necessity* (the model reaches the same answer even when steps are removed, paraphrased, or swapped for filler) Do language models actually use their reasoning steps?. Put bluntly, the trace correlates with the answer through learned formatting, not through functional computation — invalid traces frequently produce correct answers Do reasoning traces actually cause correct answers?, and structurally invalid prompts work about as well as valid ones What makes chain-of-thought reasoning actually work?. The reasoning is, in a real sense, constrained imitation rather than inference What makes chain-of-thought reasoning actually work?.

But "how much diverges" turns out to depend heavily on *task difficulty*, which is the part most people don't expect. Activation probes show models often commit to an answer internally long before they finish writing — but mainly on easy problems. On hard problems, the unfolding text actually tracks the model's shifting internal belief, with detectable inflection points where it changes its mind Does chain-of-thought reasoning reflect genuine thinking or performance?. So divergence isn't uniform: easy questions get performative theater, hard questions get something closer to genuine in-progress thinking. This also explains why the optimal chain length follows an inverted-U and shrinks as models get more capable — stronger models need less scaffolding to land the answer Why does chain of thought accuracy eventually decline with length?.

If much of the text doesn't carry the load, which parts do? Three independent methods — counterfactual resampling, attention analysis, and causal suppression — point to the same sparse set of "thought anchors": planning and backtracking sentences that genuinely steer everything downstream Which sentences actually steer a reasoning trace?. Meanwhile verification and backtracking steps that receive little downstream attention can be pruned, letting one framework cut ~75% of steps without hurting accuracy Can reasoning steps be dynamically pruned without losing accuracy?. Chain of Draft pushes this further: matched accuracy at 7.6% of the tokens, meaning roughly 92% of a standard chain was serving style and documentation, not computation Can minimal reasoning chains match full explanations?.

The divergence isn't fixed — it can be *made worse*. Fine-tuning measurably loosens the connection between steps and answers: after fine-tuning, early termination, paraphrasing, and filler substitution all leave the answer unchanged more often, so reasoning drifts toward the performative end Does fine-tuning disconnect reasoning steps from final answers?. There's even a flavor of divergence that lives *after* the answer is effectively settled: when a correct trace keeps exploring past the point of sufficient evidence, that trailing reasoning actively harms supervised fine-tuning more than equally long random text would Does every correct chain-of-thought trace improve fine-tuning?.

The thing worth walking away with: "how much diverges" is the wrong frame if you imagine one number. A better mental model is that a chain-of-thought is mostly fluent connective tissue wrapped around a few load-bearing pivots — and the ratio of tissue to pivot widens on easy tasks, after fine-tuning, and in larger models, while genuine step-by-step belief-tracking is reserved for problems the model can't shortcut.

Sources 11 notes

Do language models actually use their reasoning steps?

LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Does chain-of-thought reasoning reflect genuine thinking or performance?

Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Which sentences actually steer a reasoning trace?

Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does every correct chain-of-thought trace improve fine-tuning?

Post-conclusion reasoning—where the model keeps exploring after sufficient evidence for the answer—degrades supervised fine-tuning despite preserving correctness. Removing only this tail improves learning more than removing equally-long random suffixes, proving the harm comes from unnecessary exploration, not length.

How much of chain-of-thought reasoning actually diverges from the final answer?

Sources 11 notes

Next inquiring lines