Reinforcement Learning for LLMs

Does training objective determine which direction models fail at abstention?

Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.

Note · 2026-02-23 · sourced from Alignment
Why does chain-of-thought reasoning fail so often? How do you add domain expertise without losing general reasoning?

LLM abstention calibration fails in both directions depending on the training objective, not the model's general capability:

Reasoning-trained models under-abstain. RL/RLHF training for reasoning optimizes answer generation. Abstention is penalized because "I don't know" receives no reward. Since Does reasoning fine-tuning make models worse at declining to answer?, the result is overconfident models that answer when they shouldn't.

Safety-trained models over-abstain. RLHF with safety emphasis raises uncertainty thresholds too high. Models refuse benign prompts or decline complex but answerable open-ended tasks. TrustLLM demonstrates safety-training-driven over-refusal on completely safe questions.

Base models split by domain complexity. In simple templated tasks, base models calibrate reasonably. In complex open-ended domains (legal reasoning, medical diagnosis), base models set their uncertainty threshold too conservatively, under-answering questions they could handle.

The implication: "calibration" is not a single axis that can be fixed by one technique. The training objective creates a characteristic failure signature. A model that was tuned for both reasoning and safety faces contradictory calibration pressures — one pushes toward answering, the other toward refusing. This may explain why reasoning fine-tuning degrades abstention: it actively counteracts the safety training's conservative bias. A potential resolution exists: Does binary reward training hurt model calibration?, suggesting the axis conflict can be addressed at the reward design level.

For post-writing: connects to "the critical thinking problem" (reasoning training optimizes narrow thinking while degrading meta-cognitive judgment about when not to think) and the broader theme that training optimizes a target metric while degrading adjacent capabilities.

Related concepts in this collection

Concept map
16 direct connections · 194 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

training objective determines abstention direction — reasoning training under-abstains while safety training over-abstains