Does CoT reasoning actually cause the outputs that follow it?

This explores whether the chain-of-thought (CoT) text a model writes before its answer actually *produces* that answer — or whether it's a stylistic byproduct that correlates with the answer without driving it.

This explores whether the reasoning a model writes out is causally responsible for the answer that follows, or just decorative narration alongside it. The corpus leans hard toward the second reading — the trace is more correlation than cause. The sharpest version comes from work showing that a model's intermediate tokens carry no special execution semantics; they're generated the same way as any other output, and invalid traces routinely produce correct answers Do reasoning traces actually cause correct answers?. If broken reasoning still lands the right result, the reasoning can't be what's doing the work. That logic is echoed at the prompt level: logically invalid CoT exemplars perform almost as well as valid ones, so it's the *form* of stepping through a problem, not the validity of the steps, that moves the needle Does logical validity actually drive chain-of-thought gains?.

The deeper surprise is *where* the real computation happens. Logit-lens analysis finds that models can compute the correct answer in their earliest layers, then actively overwrite that representation in later layers to emit format-compliant filler — meaning the visible 'reasoning' tokens can be downstream of a conclusion the model already reached internally Do transformers hide reasoning before producing filler tokens?. So the causal story is sometimes backwards: the answer shapes the trace as much as the trace shapes the answer. This also reframes 'reflection.' Across eight reasoning models, the reflective steps rarely change the initial answer and the traces don't faithfully represent the underlying process — reflection reads as confirmatory theater rather than a control knob on the output Can we actually trust reasoning model outputs?.

But 'no causal role' would overstate it. A shift-cipher decomposition shows CoT performance splits into three independent factors — raw output probability, memorization, and a genuine step-by-step reasoning component that really does compute, just noisily, accumulating error with each step What three separate factors drive chain-of-thought performance?. There *is* a causal channel; it's simply weak and easily swamped by probability and memorization. The picture that emerges across these notes is that CoT mostly mimics the *shape* of reasoning learned from training rather than performing fresh inference, which is why it degrades predictably when you push it off-distribution Does chain-of-thought reasoning reveal genuine inference or pattern matching?, Does chain-of-thought reasoning actually generalize beyond training data?.

What 'shape' means concretely is striking: controlled ablations show models tolerate 50% corrupted *numbers* with barely a 3% accuracy hit, but shuffling the *order* of steps costs four times as much — what distills is the logical architecture, the sequencing, not the content What do models actually learn from chain-of-thought training?. That structural-coherence finding is the connective tissue under the whole question: the trace influences the output through its scaffolding (does this look like a well-ordered derivation?), not through the truth of any individual line. It's also why a 1.5B model can match much larger RL-trained models by learning output *format* alone — reasoning and knowledge turn out to be separable, and CoT lives more on the format side Can small models reason well by just learning output format?.

The thing you might not have known you wanted to know: the unfaithfulness of traces isn't just a curiosity, it's a safety problem. If the written reasoning isn't what caused the answer, then reading the trace to audit *why* a model did something is reading a plausible story it tells after the fact — and monitors built on that assumption are easy to game Do reasoning traces actually cause correct answers?, Can we actually trust reasoning model outputs?. If you want to go deeper on the one place the corpus argues the activation is *genuine* even when benchmarks are contaminated, the RLVR work is the counterweight to read next Can genuine reasoning activation coexist with contaminated benchmarks?.

Sources 10 notes

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

What do models actually learn from chain-of-thought training?

Controlled ablations show models tolerate 50% corrupted numbers (3.2% accuracy loss) but fail under step shuffling (13.3% loss). What distills across reasoning demonstrations is logical architecture—how steps sequence and connect—not factual accuracy.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does CoT reasoning actually cause the outputs that follow it?

Sources 10 notes

Next inquiring lines