Why does binary reward forcing degrade model calibration?

This explores why training models on pass/fail rewards (right vs. wrong, nothing in between) makes their confidence estimates unreliable — and what the corpus suggests as fixes.

This explores why binary correctness rewards — where a model gets +1 for a right answer and 0 for a wrong one, with no middle ground — push models toward overconfident guessing. The mechanism is almost embarrassingly simple once you see it: a binary reward never punishes a confident wrong answer any more than a hesitant one. If guessing and abstaining both score zero when you're wrong, but guessing occasionally scores a point, the math rewards always-guessing. The model learns to be confident everywhere, including where it shouldn't be — and calibration (the match between how sure a model sounds and how often it's right) collapses. Does binary reward training hurt model calibration? shows this isn't a quirk of one setup but a provable consequence, and that adding a Brier score (a 'proper scoring rule' that explicitly penalizes confident errors) as a second reward term fixes it without trading away accuracy.

The deeper issue is that a single binary scalar throws away information the model could have used. Can scalar rewards capture all the information in agent feedback? makes this general point: feedback naturally carries two separable things — how good an action was (evaluative) and how it should change (directive) — and a scalar reward captures only the first. Binary reward is the most extreme compression of that scalar: it flattens the entire spectrum of 'how wrong, and in what way' into a single bit. Calibration is exactly the casualty, because calibration lives in the gradations the bit erased.

The corpus converges on a clear repair strategy: give the reward more than two states. Can three-way rewards fix the accuracy versus abstention problem? adds a third option — correct (+1), hallucination (−1), abstention (somewhere in between) — which makes 'I don't know' a learnable move rather than a guaranteed loss, cutting hallucinations by nearly 29%. Can model confidence work as a reward signal for reasoning? goes further and uses the model's own answer-span confidence as the reward signal, reversing RLHF's calibration damage while sharpening reasoning — and notably without human labels. Both treat calibration not as something to bolt on afterward but as something the reward shape either preserves or destroys.

There's a useful tension worth pulling on here. Does negative reinforcement alone outperform full reinforcement learning? finds that training on negative samples alone — just suppressing wrong trajectories — often matches full RL while preserving the answer diversity that positive-only reinforcement crushes by piling probability mass onto a few favored answers. That probability-mass concentration is calibration collapse seen from another angle: the model becomes peaky and overconfident. So part of why binary reward hurts calibration is the same reason positive-only reinforcement narrows diversity — both reshape the output distribution toward overcommitment.

Worth knowing for the curious: this connects to a broader finding that reward-based RL mostly reshapes *which* answers a model commits to rather than expanding what it can do. What does reward learning actually do to model reasoning? and Does RLVR actually expand what models can reason about? show RLVR improves sampling efficiency by concentrating toward solutions already in the base model — which is precisely the dynamic that, taken too far with a crude binary signal, sacrifices calibrated uncertainty for confident commitment. The fix in every case is to make the reward carry more structure: a proper scoring rule, a third 'abstain' state, or a continuous confidence signal — anything richer than a single bit.

Sources 7 notes

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Why does binary reward forcing degrade model calibration?

Sources 7 notes

Next inquiring lines