Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
Binary correctness reward is the dominant approach to RL training for reasoning: correct answer earns 1, incorrect earns 0. RLCR identifies its structural flaw: a binary reward does not penalize high-confidence wrong answers. The reward for "correct with 99% confidence" equals the reward for "correct with 51% confidence." Therefore the model has no incentive to match its expressed confidence to its actual accuracy — high-confidence guessing is rational if it succeeds often enough.
The consequence is calibration degradation: models become more confident over the training run but not proportionally more accurate. On out-of-domain problems, where accuracy doesn't keep pace with confidence, the degradation produces higher rates of confident incorrect answers — what the paper frames as increased hallucination frequency.
The mathematical fix: add the Brier score as a second reward term alongside binary correctness. The Brier score is a proper scoring rule — it is uniquely maximized when predicted probabilities exactly match true outcome probabilities. The composite reward RLCR is therefore provably maximized only when the model (1) outputs the most likely correct answer AND (2) expresses a calibrated confidence estimate. The proof holds for any bounded proper scoring rule as the calibration term.
A surprising negative: the log-likelihood loss, also a proper scoring rule, does NOT have this property when combined with binary correctness reward — it can incentivize incorrect answers under specific confidence profiles. The bounded property of the Brier score is what enables the joint optimization guarantee.
Empirically: across diverse datasets, RLCR substantially improves calibration on both in-domain and out-of-domain evaluations with no accuracy cost. Standard RL hurts calibration; RLCR improves it.
The RLSF (RL with Self-Confidence) framework provides a complementary approach: instead of adding an external calibration term, it uses the model's own verbalized confidence as an intrinsic reward signal. The model generates a confidence estimate alongside its answer, and the reward combines correctness with confidence calibration. This is architecturally simpler than RLCR's Brier score approach but relies on the model's ability to self-assess — which Does reflection in reasoning models actually correct errors? suggests may be unreliable. RLCR's mathematical guarantee may be more robust than RLSF's empirical approach.
Two complementary robustness approaches from the reward hacking literature:
Bayesian Reward Model Ensembles (BRME) — train a multi-head reward model where each head outputs mean and standard deviation of a Gaussian. The head with lowest standard deviation provides the nominal reward (highest confidence). The ensemble characterizes an uncertainty set of reward functions, enabling a composite objective that balances nominal performance with worst-case robustness. This addresses calibration from the reward model side rather than the reward function side.
Contrastive Rewards — compute baseline responses offline, then use the reward difference between online-generated and baseline responses as a penalty term in PPO. This calibrates the RL process by making rewards relative rather than absolute, penalizing reward uncertainty and calibrating according to task difficulty. The contrastive signal provides implicit comparative information that absolute rewards lack.
Both approaches are complementary to RLCR: BRME addresses reward model uncertainty, contrastive rewards address reward signal relativity, and RLCR addresses the fundamental incentive structure of binary rewards. Together they suggest calibration degradation has multiple attack surfaces — no single fix addresses all of them.
Connects to Does reasoning fine-tuning make models worse at declining to answer?: both identify the calibration cost of reasoning training. RLCR reframes this as a reward design failure rather than an inherent trade-off — the degradation is a property of binary-only reward, not of reasoning training as such.
Source: Reasoning by Reflection, Reward Models
Related concepts in this collection
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
RLCR reframes: calibration degradation is a binary-reward design choice, not an inherent trade-off; it is fixable with a proper scoring rule
-
Does step-level confidence outperform global averaging for trace filtering?
Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.
RLCR produces the calibrated model that makes verbalized-confidence scaling meaningful at test time
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
mechanistic link: entropy collapse and calibration degradation are two faces of the same RL dynamic; as the policy concentrates probability mass (entropy collapses), expressed confidence increases without matching accuracy gains
-
Why do reasoning models fail at predicting disagreement?
RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
binary reward's calibration failure extends to disagreement prediction: correct/incorrect cannot represent variance distributions; RLVR models lose sensitivity to legitimate annotation spread
-
Why do accurate predictions lead to poor decisions?
Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.
calibration degradation is a specific instance of the prediction-decision gap: binary reward optimizes for correct answers (decision quality) while degrading probability estimates (prediction quality); RLCR's composite reward explicitly separates these two objectives within a single reward function
-
Can utility-weighted training loss actually harm model performance?
When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.
general principle: binary reward is a special case of the learning/choosing conflation; it correctly incentivizes choosing the right answer but fails to incentivize learning calibrated uncertainty; the RLCR fix (add Brier score) operationalizes the "separate learning from choosing" prescription in the RL reward design space
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
complementary approach: RLCR adds an external calibration term (Brier score) to the reward; RLSF uses the model's own confidence as the reward signal; RLCR has a mathematical guarantee while RLSF is architecturally simpler but relies on self-assessment quality
-
Can uncertainty estimation replace complex adaptive retrieval?
Is a simpler approach using model confidence signals sufficient to decide when retrieval is needed, or do complex multi-call adaptive pipelines deliver meaningful benefits?
calibration quality is the upstream prerequisite for uncertainty-triggered retrieval: FLARE and similar systems rely on token-probability confidence as a reliable signal; binary RL training degrades exactly this calibration, undermining the assumption that low-probability tokens reliably signal knowledge gaps
-
Can we detect when language models confabulate?
Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?
semantic entropy provides an alternative uncertainty signal that operates at the meaning level rather than the token level; well-calibrated models (the output of RLCR) should have lower semantic entropy on questions they answer correctly, creating a testable link between calibration quality and confabulation detection
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
binary reward rl provably degrades calibration — adding a proper scoring rule as a second reward term jointly optimizes accuracy and calibration without trade-off