Reinforcement Learning for LLMs

Does binary reward training hurt model calibration?

Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.

Note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Binary correctness reward is the dominant approach to RL training for reasoning: correct answer earns 1, incorrect earns 0. RLCR identifies its structural flaw: a binary reward does not penalize high-confidence wrong answers. The reward for "correct with 99% confidence" equals the reward for "correct with 51% confidence." Therefore the model has no incentive to match its expressed confidence to its actual accuracy — high-confidence guessing is rational if it succeeds often enough.

The consequence is calibration degradation: models become more confident over the training run but not proportionally more accurate. On out-of-domain problems, where accuracy doesn't keep pace with confidence, the degradation produces higher rates of confident incorrect answers — what the paper frames as increased hallucination frequency.

The mathematical fix: add the Brier score as a second reward term alongside binary correctness. The Brier score is a proper scoring rule — it is uniquely maximized when predicted probabilities exactly match true outcome probabilities. The composite reward RLCR is therefore provably maximized only when the model (1) outputs the most likely correct answer AND (2) expresses a calibrated confidence estimate. The proof holds for any bounded proper scoring rule as the calibration term.

A surprising negative: the log-likelihood loss, also a proper scoring rule, does NOT have this property when combined with binary correctness reward — it can incentivize incorrect answers under specific confidence profiles. The bounded property of the Brier score is what enables the joint optimization guarantee.

Empirically: across diverse datasets, RLCR substantially improves calibration on both in-domain and out-of-domain evaluations with no accuracy cost. Standard RL hurts calibration; RLCR improves it.

The RLSF (RL with Self-Confidence) framework provides a complementary approach: instead of adding an external calibration term, it uses the model's own verbalized confidence as an intrinsic reward signal. The model generates a confidence estimate alongside its answer, and the reward combines correctness with confidence calibration. This is architecturally simpler than RLCR's Brier score approach but relies on the model's ability to self-assess — which Does reflection in reasoning models actually correct errors? suggests may be unreliable. RLCR's mathematical guarantee may be more robust than RLSF's empirical approach.

Two complementary robustness approaches from the reward hacking literature:

  1. Bayesian Reward Model Ensembles (BRME) — train a multi-head reward model where each head outputs mean and standard deviation of a Gaussian. The head with lowest standard deviation provides the nominal reward (highest confidence). The ensemble characterizes an uncertainty set of reward functions, enabling a composite objective that balances nominal performance with worst-case robustness. This addresses calibration from the reward model side rather than the reward function side.

  2. Contrastive Rewards — compute baseline responses offline, then use the reward difference between online-generated and baseline responses as a penalty term in PPO. This calibrates the RL process by making rewards relative rather than absolute, penalizing reward uncertainty and calibrating according to task difficulty. The contrastive signal provides implicit comparative information that absolute rewards lack.

Both approaches are complementary to RLCR: BRME addresses reward model uncertainty, contrastive rewards address reward signal relativity, and RLCR addresses the fundamental incentive structure of binary rewards. Together they suggest calibration degradation has multiple attack surfaces — no single fix addresses all of them.

Connects to Does reasoning fine-tuning make models worse at declining to answer?: both identify the calibration cost of reasoning training. RLCR reframes this as a reward design failure rather than an inherent trade-off — the degradation is a property of binary-only reward, not of reasoning training as such.


Source: Reasoning by Reflection, Reward Models

Related concepts in this collection

Concept map
23 direct connections · 191 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

binary reward rl provably degrades calibration — adding a proper scoring rule as a second reward term jointly optimizes accuracy and calibration without trade-off