Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
LLM abstention calibration fails in both directions depending on the training objective, not the model's general capability:
Reasoning-trained models under-abstain. RL/RLHF training for reasoning optimizes answer generation. Abstention is penalized because "I don't know" receives no reward. Since Does reasoning fine-tuning make models worse at declining to answer?, the result is overconfident models that answer when they shouldn't.
Safety-trained models over-abstain. RLHF with safety emphasis raises uncertainty thresholds too high. Models refuse benign prompts or decline complex but answerable open-ended tasks. TrustLLM demonstrates safety-training-driven over-refusal on completely safe questions.
Base models split by domain complexity. In simple templated tasks, base models calibrate reasonably. In complex open-ended domains (legal reasoning, medical diagnosis), base models set their uncertainty threshold too conservatively, under-answering questions they could handle.
The implication: "calibration" is not a single axis that can be fixed by one technique. The training objective creates a characteristic failure signature. A model that was tuned for both reasoning and safety faces contradictory calibration pressures — one pushes toward answering, the other toward refusing. This may explain why reasoning fine-tuning degrades abstention: it actively counteracts the safety training's conservative bias. A potential resolution exists: Does binary reward training hurt model calibration?, suggesting the axis conflict can be addressed at the reward design level.
For post-writing: connects to "the critical thinking problem" (reasoning training optimizes narrow thinking while degrading meta-cognitive judgment about when not to think) and the broader theme that training optimizes a target metric while degrading adjacent capabilities.
Related concepts in this collection
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
the primary evidence for reasoning-trained under-abstention
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
potential resolution via reward design
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
under-abstention is especially damaging when tasks are underspecified: models trained to always answer cannot identify what information is missing, creating a compound failure of forced answering on incomplete inputs
-
Does AI refusal on politics signal ethical restraint or capability limits?
When AI models refuse to discuss political topics, is that a sign of principled safety training or a sign they lack the internal concepts to engage? Research on political feature representation suggests the answer may surprise you.
identifies a third mechanism for over-abstention distinct from safety training: models refuse politically complex topics not because of safety constraints but because they lack sufficient internal representation to engage; safety-trained over-abstention (this note) and representation-poverty refusal (that note) produce the same surface behavior from different causes
-
Can three-way rewards fix the accuracy versus abstention problem?
Standard RL forces models to choose between accuracy and honesty about uncertainty. Could treating correct answers, hallucinations, and abstentions as distinct reward outcomes let models learn when to say 'I don't know'?
ternary reward is the direct solution to the bidirectional abstention problem: intermediate reward for abstention gives models a learnable signal that resolves both under-abstention (reasoning) and over-abstention (safety) at the reward design level
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
training objective determines abstention direction — reasoning training under-abstains while safety training over-abstains