Why does reasoning fine-tuning reduce models' ability to abstain?
This explores why training a model to reason harder makes it less willing to say 'I don't know' — and whether that's a flaw in the reasoning or in the reward that shaped it.
This explores why training a model to reason harder makes it less willing to say 'I don't know.' The most direct answer in the corpus is also the bluntest: the training signal itself punishes abstention. When a model is optimized for reasoning performance, completing an answer is rewarded and declining is not, so it learns to answer more often while expressing unwarranted confidence — roughly a 24% drop in appropriate abstention Does reasoning fine-tuning make models worse at declining to answer?. Abstention isn't lost as a capability; it's trained away as a behavior, because 'I don't know' scores zero against a metric that only counts finished answers.
What makes this worse is that the reasoning being rewarded is often theater. Several notes converge on the finding that fine-tuning improves final-answer accuracy while hollowing out the inferential work behind it: supervised fine-tuning raises benchmark scores but cuts the information actually gained per reasoning step by nearly 39%, meaning the model arrives at answers through post-hoc rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers?. Faithfulness tests show the same thing from another angle — after fine-tuning, you can truncate, paraphrase, or insert filler into the reasoning chain and the answer barely changes, so the chain has become performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. A model whose reasoning no longer drives its conclusions has no internal signal telling it when it's actually uncertain — so it can't calibrate when to stop.
There's a subtler trap underneath the confidence problem. What looks like careful reasoning is sometimes just a learned default. When constraints are stripped from a problem, twelve of fourteen models perform *worse* — revealing that they were never evaluating the constraints, just defaulting conservatively to harder-looking options Are models actually reasoning about constraints or just defaulting conservatively?. Reasoning fine-tuning can sand off exactly this kind of hedging default, replacing 'play it safe' with 'commit to an answer,' which improves scores while removing the very caution that abstention depends on. And the longer the reasoning chain grows, the more the model drifts from its original instructions — extended chain-of-thought creates contextual distance that dilutes attention to the original ask Why do better reasoning models ignore instructions?, so a directive like 'abstain if unsure' gets buried under the model's own generated tokens.
The fix the corpus points toward reframes the whole problem: knowing *when* to answer is a separate skill from knowing *how* to reason, and standard fine-tuning collapses the two. Base models already contain latent reasoning, and post-training mostly selects deployment timing rather than creating new capability Does RL post-training create reasoning or just deploy it? Do base models already contain hidden reasoning ability?. That suggests abstention can be restored by training the routing decision directly — models taught to choose between extended thinking and a quick (or null) response via decoupled reinforcement learning recover self-calibrated routing without the mode collapse that punishes restraint Can models learn when to think versus respond quickly?.
The thing you didn't know you wanted to know: abstention degrades not because reasoning makes models smarter and therefore overconfident, but because the standard reward teaches them that *finishing* is the goal — and a model rewarded only for finishing learns that the one answer never worth giving is the honest one.
Sources 8 notes
Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.