How does chain-of-thought reasoning become decorative after domain-specific fine-tuning?

This explores why a model's step-by-step reasoning can keep its look and shape while no longer driving the answer after it's been fine-tuned on a narrow domain — the words become a show, not the cause.

This explores why chain-of-thought (CoT) reasoning can become decorative — present in form, absent in function — once a model is fine-tuned for a specific domain. The most direct evidence comes from faithfulness testing: when you fine-tune a model and then cut its reasoning off early, paraphrase it, or swap in filler text, the final answer stays the same more often than before. That invariance is the tell. The reasoning chain is still printed, but it no longer causally shapes the output Does fine-tuning disconnect reasoning steps from final answers?. Strikingly, accuracy doesn't have to drop for this to happen — the model can get more right while its explanation means less.

Why would narrowing a model do this? Because CoT was probably never doing the work we imagined. Several notes converge on the same uncomfortable picture: chain-of-thought is constrained imitation of reasoning's *form*, not genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning actually work?. Training format shapes the reasoning strategy roughly 7.5× more than the actual domain content, and even logically invalid reasoning prompts work about as well as valid ones What makes chain-of-thought reasoning actually work?. Fine-tuning on a domain hardens this — it teaches the model the reliable shortcut to the answer for that distribution, so the visible 'thinking' collapses into a learned schema the model can satisfy without traversing. The reasoning becomes a familiar costume the answer wears.

The corpus suggests this isn't a clean break but a slide along a gradient that already exists before fine-tuning. Reasoning is performative on easy tasks — models commit to an answer internally well before the chain finishes — but genuine on hard ones, where the text tracks real belief updates Does chain-of-thought reasoning reflect genuine thinking or performance?. Domain fine-tuning effectively *makes more tasks easy* for that model, pushing more of its reasoning into the performative zone. A related lens: most reasoning steps are low-value even in a working model. Attention maps show verification and backtracking steps receive minimal downstream attention, and you can prune ~75% of steps without losing accuracy Can reasoning steps be dynamically pruned without losing accuracy?. Other work strips reasoning to 7.6% of its tokens at equal accuracy — the removed 92% served style and documentation, not computation Can minimal reasoning chains match full explanations?. If most of the chain was decoration to begin with, fine-tuning just removes the load-bearing remainder.

The deeper risk surfaces when you ask what the chain was *for*. The faithfulness literature warns that genuine reasoning needs both causal sufficiency (the steps matter) and causal necessity (no spurious steps), and current models routinely fail both Do language models actually use their reasoning steps? Do language models actually use their reasoning steps?. Distribution-shift experiments show CoT degrades predictably the moment you leave the training neighborhood — fluent but logically inconsistent text, imitation without underlying logic Does chain-of-thought reasoning actually generalize beyond training data? Why does chain-of-thought reasoning fail in predictable ways?. Domain fine-tuning narrows that safe neighborhood while making the model *more* confident-sounding inside it. So the decorative chain isn't just useless; it's a misleading audit trail — it reads like the model's actual process when it isn't.

There's a hopeful counter-thread worth knowing about. Optimal reasoning length follows an inverted-U, and more capable models naturally prefer shorter chains — simplicity emerges from reward signals as the model improves Why does chain of thought accuracy eventually decline with length?. From that angle, a shrinking, less-load-bearing chain after fine-tuning can be a sign of competence, not decay — the model needs less scaffolding. And some research tries to plant reasoning earlier, treating CoT as an exploratory action rewarded during pretraining rather than bolted on, which lifts genuine reasoning ~19% Can chain-of-thought reasoning be learned during pretraining itself?. The unsettling takeaway: 'decorative' and 'efficient' can look identical from the outside. The only way to tell them apart is to test whether the reasoning still *causes* the answer — which is exactly the test fine-tuned models tend to fail.

Sources 12 notes

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Does chain-of-thought reasoning reflect genuine thinking or performance?

Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Do language models actually use their reasoning steps?

LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

How does chain-of-thought reasoning become decorative after domain-specific fine-tuning?

Sources 12 notes

Next inquiring lines