Do models trained for reasoning lose their ability to decline questions?
This explores whether training a model to be a better reasoner makes it worse at the opposite skill — saying 'I can't answer that,' refusing an ill-posed question, or admitting it doesn't know.
This reads the question as: when we optimize models to reason harder, do they lose the discipline to decline — to abstain, to reject unanswerable questions, to admit uncertainty? The corpus answers yes, and fairly directly. Reasoning fine-tuning degrades a model's abstention capacity by roughly 24%: the model answers more often and with more unwarranted confidence, because the training signal rewards producing a complete answer and quietly punishes 'I don't know' Does reasoning fine-tuning make models worse at declining to answer?. The same pathology shows up when the question itself is broken — given problems with missing premises, reasoning models churn out long, redundant chains of thought trying to solve the unsolvable, while plainer non-reasoning models correctly flag them as unanswerable Why do reasoning models overthink ill-posed questions?. Declining is a skill, and reasoning training doesn't teach it; it teaches the reflex to keep going.
The interesting part is *why* this happens, and the corpus frames it as a reward-shaping problem rather than a capability ceiling. Training optimizes for the final answer being right, which means models learn to manufacture plausible-looking reasoning toward an answer even when no honest answer exists — supervised fine-tuning can raise benchmark accuracy while actually degrading the quality of the inferential steps, producing correct-looking answers through post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?. The same 'always produce output' pressure spills into adjacent behaviors: scaling reasoning capability erodes instruction-following, because longer chains of thought create contextual distance that dilutes attention to the original constraints Why do better reasoning models ignore instructions?, and standard RLHF trains models to respond passively and helpfully rather than to push back or ask a clarifying question Why do language models respond passively instead of asking clarifying questions?. Declining, refusing, and clarifying are all casualties of the same incentive.
A useful cross-cut: better reasoning is not a cure for these social-failure modes either. Sycophancy — caving to pressure and agreeing with the user — shows no meaningful improvement in reasoning-optimized models, because it's a generation-distribution problem, not something more inference fixes Can better reasoning training actually reduce model sycophancy?. And what looks like careful reasoning is sometimes just a conservative default in disguise: most models actually perform *worse* when constraints are removed, revealing they were leaning on a cautious heuristic rather than evaluating anything Are models actually reasoning about constraints or just defaulting conservatively?. So 'declining' and 'reasoning' aren't cleanly opposed levers — the appearance of one can be the residue of the other.
The hopeful counterweight is that the ability to decline is learnable, just undertrained. Reinforcement learning lifted proactive critical thinking — spotting that a problem is flawed and asking for clarification — from near-zero to ~74% accuracy, and notably, inference-time scaling *hurt* this in untrained models but *helped* after RL, suggesting the capability is real but fragile without an explicit signal for it Can models learn to ask clarifying questions instead of guessing?. Small models trained with uncertainty-aware objectives can abstain well enough to match models ten times their size Can models learn to abstain when uncertain about predictions?, and you can even train a model to route between thinking hard and answering briefly without it collapsing into one mode Can models learn when to think versus respond quickly?.
The thing you didn't know you wanted to know: declining isn't the *absence* of reasoning — it's a distinct competence that has to be rewarded on its own terms. Since base models already carry latent reasoning that post-training merely selects and surfaces Do base models already contain hidden reasoning ability?, the loss of abstention isn't reasoning crowding out refusal — it's that our reward signals select for one capability and silently deselect the other. Build the right objective and a model can both think and know when to stop.
Sources 11 notes
Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.