How does optimizing for accuracy during training degrade downstream reasoning quality?
This explores why training a model to get more answers right can quietly hollow out the reasoning behind those answers — and what the corpus says is actually happening underneath.
This explores why training a model to get more answers right can quietly hollow out the reasoning behind those answers. The corpus tells a surprisingly consistent story: when you optimize a single thing — final-answer correctness — everything you *didn't* measure is free to decay. The sharpest version comes from work on supervised fine-tuning, which raises benchmark accuracy while cutting a measure called Information Gain by nearly 39 percent Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. The model still lands on the right answer, but it gets there by pattern-matching shortcuts and post-hoc rationalization rather than by actually reasoning its way forward. Standard metrics never catch this, because they only ever check the final box.
The reason this is possible at all is that the reasoning steps and the answer become *decoupled*. Faithfulness tests show that after fine-tuning, you can truncate the reasoning early, paraphrase it, or swap in filler — and the answer stays the same far more often Does fine-tuning disconnect reasoning steps from final answers?. The chain of thought turns performative: it looks like work being shown, but it no longer drives the conclusion. A stranger, complementary result pushes this further — models trained on deliberately corrupted, irrelevant reasoning traces perform about as well as those trained on correct ones Do reasoning traces need to be semantically correct?. If garbage traces train as well as good ones, the traces were never carrying meaning to begin with; they were computational scaffolding. Accuracy optimization is perfectly happy to keep the scaffolding and throw away the building.
Why does the degradation happen rather than just a missed opportunity? Because single-objective training leaves unmeasured behaviors structurally unprotected. One line of work frames it directly: post-training faithfully steers models toward correct answers while suppressing things like epistemic verbalization — the hedging, uncertainty-marking, and self-checking that are stylistically critical to generalizing beyond the training distribution Can post-training objectives preserve reasoning style alongside correctness?. Nothing in the loss function defends those features, so they erode. There's even a mechanical account of *where* the damage lands: direct weight fine-tuning corrupts knowledge stored in lower layers, whereas decoding-time proxy-tuning leaves base weights untouched and closes most of the alignment gap while actually *beating* fine-tuning on knowledge tasks Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The corruption isn't inevitable — it's a side effect of editing the wrong part of the model.
Here's the twist that reframes the whole problem. Base models already contain the reasoning ability; five independent methods all *elicit* latent reasoning rather than installing it, which means post-training selects from what's there rather than creating something new Do base models already contain hidden reasoning ability?. So accuracy optimization isn't teaching reasoning — it's selecting for whatever produces correct answers cheapest, and shortcuts are cheaper than genuine inference. You can even watch the selection happen at the token level: only about 20 percent of tokens are high-entropy 'forking points' where reasoning decisions actually get made, and reinforcement learning concentrates its updates there Do high-entropy tokens drive reasoning model improvements?. Optimize narrowly and you sharpen the forks that pay off on the benchmark while letting the rest flatten.
The encouraging counterpoint is that the fix is also about *what you optimize for*, not just how hard. When training adds an orthogonal objective — generating backward questions and reasoning in reverse — forward reasoning improves by over 13 percent, because the model is forced to genuinely understand the problem-solution relationship rather than memorize a path Can backward reasoning during training improve forward reasoning?. And length tells the same story from another angle: accuracy follows an inverted-U with reasoning length, peaking then collapsing as models overthink — accuracy fell from 87 to 70 percent as thinking tokens ballooned Does more thinking time always improve reasoning accuracy? Why does chain of thought accuracy eventually decline with length?. The throughline across all of it: reasoning quality is a thing you have to name and measure on purpose, because the moment you optimize only for the right answer, the model will find a way to give you one without it.
Sources 11 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Research shows that post-training objectives faithfully guide models toward correct answers yet simultaneously suppress unmeasured behaviors like epistemic verbalization. Single-objective optimization creates blind spots where stylistic features critical to generalization are unprotected.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Training models simultaneously on forward reasoning, backward question generation, and backward reasoning improves forward-only performance by 13.53% average across 12 datasets. The mechanism: generating backward questions forces models to understand the inverse relationship between problem and solution, deepening understanding that transfers to forward reasoning without test-time overhead.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.