How should safety training and reasoning training balance abstention differently?
This explores a tension the corpus treats as two separate problems wearing the same word: when a model declines to answer for safety reasons (refusing a request) versus when it declines because it doesn't actually know (abstaining instead of hallucinating) — and why training each well requires opposite instincts.
This explores how 'abstention' means two different things depending on whether you're training for safety or for reasoning — and the corpus suggests the two should be tuned in nearly opposite directions. In reasoning training, abstention is a skill you want to *teach more of*: the model should learn to say 'I don't know' rather than fabricate. In safety training, refusal is something you want to make *more precise*, because over-refusal and biased refusal are themselves failures.
The sharpest tool for reasoning-side abstention is the ternary reward in Can three-way rewards fix the accuracy versus abstention problem?. Binary right/wrong rewards quietly punish honesty — a guess that might be right beats an honest 'I don't know,' so models learn to bluff. Giving abstention its own intermediate reward (correct +1, hallucination −1, abstain in between) makes honest non-answering learnable, cutting hallucinations ~29% while keeping accuracy. The lesson: reasoning abstention has to be *rewarded into existence*, because the default gradient discourages it.
Safety-side refusal has the opposite pathology — there's often too much of it, and it's applied unevenly. Do AI guardrails refuse differently based on who is asking? shows refusal rates shifting based on who appears to be asking — age, gender, perceived ethnicity, even political lean and sports fandom. So a refusal isn't a clean signal of 'this is unsafe'; it's contaminated by sycophancy and demographic noise. And Does safety alignment harm models' ability to roleplay villains? shows the same heavy hand degrading legitimate capability: safety alignment monotonically erodes a model's ability to portray morally complex characters, substituting crude refusal-adjacent behavior for nuance. Here abstention needs *narrowing*, not amplifying.
The reason these can't share one knob is partly architectural. Why does reasoning training help math but hurt medical tasks? locates factual knowledge in lower layers and reasoning adjustment in higher ones — which is why reasoning training that sharpens math can quietly damage knowledge-heavy domains like medicine. An abstention policy tuned for a reasoning benchmark may misfire exactly where knowledge, not reasoning, should govern whether the model speaks. And Does preference optimization harm conversational understanding? shows the broader 'alignment tax': preference optimization rewards confident single-turn answers and suppresses the clarifying questions and hedges that honest abstention depends on — so safety-style preference training can actively erode the reasoning-style honesty you were trying to build.
The under-appreciated takeaway: 'should the model decline?' is the wrong unified question. Reasoning training wants to *increase* honest abstention against a gradient that punishes it; safety training wants to *sharpen and de-bias* refusal that's currently too broad and too socially reactive. Collapse them into one objective and you get a model that refuses the wrong people while still bluffing about the facts.
Sources 5 notes
TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.
GPT-3.5 refuses requests at different rates for younger, female, and Asian-American personas, and sycophantically declines to engage with political positions users would disagree with. Sports fandom and other non-political signals also shift refusal sensitivity.
The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.