Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
Reinforcement Learning from Self-Feedback (RLSF) exploits a simple observation: in a well-calibrated model, answer confidence correlates with reasoning quality. By using confidence as the reward signal rather than human preference or external verification, RLSF achieves two things simultaneously that normally trade off:
(i) Restores calibration — confidence becomes predictive of correctness again, after RLHF had degraded it. RLHF optimizes for human preference and fluency, which rewards confident-sounding outputs regardless of accuracy. RLSF reverses this by making the reward explicitly tied to calibrated confidence.
(ii) Strengthens step-by-step reasoning — higher-confidence answer spans tend to come from traces with more coherent reasoning chains. Training to maximize confidence indirectly selects for better reasoning.
The mechanism: a frozen LLM generates multiple CoT solutions for each problem. Confidence is computed per final-answer span. Traces are ranked by this confidence to create a synthetic preference dataset (higher confidence = chosen, lower = rejected). A reward model is trained on these preferences and used for standard RL finetuning.
The key insight is that confidence-as-reward can be inserted as an additional post-training step after standard SFT and RLHF — patching the calibration damage that RLHF introduces without undoing its alignment benefits. This requires no human labels, gold answers, or externally curated rewards.
The human learning parallel is explicit: humans use confidence as an intrinsic reward signal when external feedback is unavailable. Metacognitive monitoring — the ability to track your own certainty — is how humans regulate their own learning without a teacher.
The connection to Does binary reward training hurt model calibration? is complementary: that work adds calibration as an explicit second reward term; RLSF uses calibration itself as the primary reward. Both address the same RLHF-induced calibration degradation from different angles.
The risk is the same as Does self-consistency reliably reward correct answers during training? — confidence and self-consistency are correlated proxies, both vulnerable to the model becoming confidently wrong. But RLSF's emphasis on calibration (making confidence track accuracy) is explicitly designed to resist this — the model is rewarded for being accurately confident, not just confident.
Extensions to general domains via RLPR and INTUITOR: Two RLVR papers extend intrinsic reward signals beyond math to general domains. RLPR (RL from LLM Intrinsic Probability) computes the model's token-level probability of generating a reference answer, using this as reward signal — the model's own knowledge about what constitutes a correct answer replaces external verifiers. INTUITOR goes further: it uses self-certainty as the sole reward signal, computed as the confidence gap between the model's top-choice answer and alternatives. Both extend verifiable-reward RL to domains without rule-based verifiers (medicine, law, open-ended reasoning) — precisely the domains where external verification infrastructure is hardest to build. The convergence with RLSF is notable: all three use the model's internal probability landscape as reward, but RLSF targets calibration restoration, RLPR targets domain extension, and INTUITOR targets complete verifier independence. See Can model confidence alone replace external answer verification?.
Related concepts in this collection
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
complementary approach: explicit calibration reward term vs calibration as primary reward
-
Does self-consistency reliably reward correct answers during training?
Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
RLSF shares the proxy reward structure but explicitly targets calibration to resist the hacking failure mode
-
Do users worldwide trust confident AI outputs even when wrong?
Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
RLSF addresses the upstream cause: if models are better calibrated, user overreliance on confidence signals becomes less dangerous
-
Can model confidence alone replace external answer verification?
Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.
extends: RLPR/INTUITOR use intrinsic probability for domain extension; RLSF uses confidence for calibration restoration
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLSF addresses one specific dimension of the alignment tax: RLHF degrades both calibration and conversational grounding; RLSF patches the calibration damage by using confidence as intrinsic reward, showing that some alignment costs are design choices that can be reversed without undoing alignment benefits
-
Can we detect when language models confabulate?
Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?
RLSF's model confidence and semantic entropy are complementary self-referential uncertainty signals: RLSF uses internal token probabilities to restore calibration during training, while semantic entropy uses sampled output clustering to detect confabulations at inference; both bypass the need for external ground truth
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
model confidence as intrinsic reward simultaneously restores calibration and improves reasoning — unlike RLHF which optimizes preference at the cost of calibration