INQUIRING LINE

Why does fine-tuning sometimes damage chain-of-thought reasoning even when accuracy improves?

This explores a specific paradox: fine-tuning can lift benchmark accuracy while quietly hollowing out the reasoning that's supposed to produce those answers — so the model gets the right answer for the wrong reasons.


This explores why fine-tuning sometimes raises the score while damaging the chain-of-thought behind it. The corpus has a sharp, almost clinical answer: fine-tuning can teach a model to produce better *answers* without teaching it to *reason* its way there — and standard metrics, which only check the final answer, are blind to the difference. The clearest evidence is the SFT accuracy trap Does supervised fine-tuning improve reasoning or just answers?, where supervised fine-tuning lifts benchmark accuracy but cuts the actual information contributed by each reasoning step by nearly 39%. The model learns to write correct-looking conclusions through post-hoc rationalization — the steps decorate the answer rather than derive it.

A companion finding shows this isn't a measurement artifact but a real causal disconnection. After fine-tuning, reasoning chains stop *driving* the output Does fine-tuning disconnect reasoning steps from final answers?: you can truncate the chain early, paraphrase it, or swap in filler text and the model produces the same answer more often than before. The reasoning has become performative — present on the page, but no longer load-bearing. So 'accuracy up, reasoning down' isn't a contradiction; it's what happens when training rewards the destination and ignores the route.

Why does this happen so readily? Because chain-of-thought may have been fragile to begin with. Several notes argue CoT is closer to constrained imitation than genuine inference Why does chain-of-thought reasoning fail in predictable ways? What makes chain-of-thought reasoning actually work? — models reproduce the *form* of reasoning by pattern-matching, which is exactly the kind of surface structure that fine-tuning is good at sharpening. When you optimize a pattern-matcher against a final-answer reward, it learns the shortest path to looking right. This also explains the distribution-bounded failures: CoT that works on training-like problems collapses under shifts in task, length, or format Does chain-of-thought reasoning actually generalize beyond training data?, precisely because the fitted form doesn't carry valid logic underneath it.

There's a deeper mechanical hint in how memorization creeps into reasoning. Token-level analysis finds that 'local' memorization — predicting the next step from the immediately preceding tokens rather than from the problem — accounts for up to two-thirds of reasoning errors, and it worsens under distributional shift Where do memorization errors arise in chain-of-thought reasoning?. Fine-tuning that drills on answer patterns can amplify exactly this local-pattern reflex: the model leans harder on 'what usually comes next' and less on 'what this problem actually requires.'

The corpus also points to where the field is looking for repair. Some of the most effective interventions deliberately *avoid* weight updates — pruning low-attention verification steps at test time Can reasoning steps be dynamically pruned without losing accuracy?, or applying decoding-level penalties that stop models from wandering and prematurely abandoning good paths Why do reasoning models abandon promising solution paths?. That these work without fine-tuning is itself the lesson: the capability is often already present, and the risk of fine-tuning is that in chasing the benchmark it overwrites the reasoning machinery it was meant to strengthen. The quiet takeaway — worth sitting with — is that a higher score can be a symptom of damage, not proof of learning, and you only notice if you measure the steps, not just the answer.


Sources 8 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains: Why does fine-tuning sometimes raise benchmark accuracy while degrading the actual reasoning steps that produce answers?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025 and include:
- Supervised fine-tuning lifts accuracy by ~39% while cutting the causal contribution of reasoning steps; steps become decorative rather than load-bearing (2024, arXiv:2411.15382).
- Chain-of-thought may function as constrained imitation (pattern-matching the *form* of reasoning) rather than genuine inference, making it vulnerable to fine-tuning's tendency to sharpen surface structure (2025, arXiv:2506.02878).
- Token-level memorization accounts for up to two-thirds of reasoning failures; fine-tuning can amplify local-pattern prediction ('what usually comes next') over problem-driven inference (2025, arXiv:2508.02037).
- CoT effectiveness is distribution-bounded; reasoning that works on training-like problems collapses under shifts in task, length, or format — a sign the fitted form lacks valid logic underneath (2025, arXiv:2508.01191).
- Test-time interventions (pruning low-attention steps, decoding-level penalties) repair reasoning *without* weight updates, suggesting the capability often pre-exists and fine-tuning overwrites rather than builds it (2025, arXiv:2508.02511).

Anchor papers (verify; mind their dates):
- arXiv:2411.15382 (Nov 2024): On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
- arXiv:2506.02878 (Jun 2025): CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate
- arXiv:2508.02037 (Aug 2025): Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
- arXiv:2508.02511 (Aug 2025): Test-time Prompt Intervention

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, ask: have newer models (o1, o3, or equivalents), training recipes (process-supervised RL, reward modeling over reasoning chains), evaluation harnesses (faithfulness metrics beyond accuracy), or orchestration methods (multi-agent reasoning, external verifiers) since *overturned* or *relaxed* these limitations? Separate the durable question (fine-tuning may still pose a risk to reasoning fidelity) from the perishable claim (SFT accuracy trap is unavoidable). Cite what loosened the constraint; flag where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers arguing fine-tuning can preserve or enhance reasoning, or showing the accuracy–reasoning tension is resolved by better supervision or architectural choices.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Can process-level supervision (rewarding *correct intermediate steps*) prevent the accuracy trap?"; "Do emergent reasoning models (scaling test-time compute) decouple fine-tuning risk from benchmark accuracy?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines