Can model confidence alone replace external answer verification?
Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.
RLVR's reliance on domain-specific verifiers confines it to math and code. Two complementary approaches extend RLVR to general domains by replacing external verification with intrinsic signals.
RLPR (Reinforcement Learning with Reference Probability Reward) uses the LLM's own token probability of generating a reference answer as the reward signal. The probability reflects how well the reasoning process leads to the correct answer and measures how likely the model is to take the correct action. Two key innovations: (1) a Probability-based Reward computed from average decoding probabilities of reference answer tokens, showing better robustness than naive sequence likelihood, and (2) stabilization methods to address the high variance inherent in probability-based rewards. RLPR consistently improves reasoning across Gemma, Llama, and Qwen models on both general-domain and mathematical benchmarks.
INTUITOR goes further: it uses the model's own confidence — self-certainty measured as average KL divergence between the output distribution and a uniform distribution — as its sole reward signal. No reference answers, no external verifiers, no labeled data. The approach is simple: replace the verifiable reward in GRPO with self-certainty scores. The mechanism builds on the observation that LLMs exhibit lower confidence on difficult problems; optimizing for confidence should drive the model toward more reliable reasoning.
Both approaches raise the same fundamental question for future AI: as models develop capabilities beyond human evaluation, self-generated signals may be the only viable training pathway. Since Can model confidence work as a reward signal for reasoning?, there is convergent evidence that intrinsic confidence signals can serve dual roles — improving both performance and reliability.
Since Can reasoning RL work without verifying generated answers?, RLPR and INTUITOR represent the next step: progressively weaker assumptions about what external signal is needed, from reference verification to reference probability to pure self-certainty.
Source: RLVR
Related concepts in this collection
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
convergent: confidence as reward improves both performance and calibration
-
Can reasoning RL work without verifying generated answers?
Most reasoning RL methods require answer verification, limiting them to math and code. Can models be trained to reason better in domains like medicine and law where verification is impractical?
RLPR/INTUITOR extend this progression
-
Does self-consistency reliably reward correct answers during training?
Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
risk: confidence-based rewards may select for confident errors
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
intrinsic rewards face the same ceiling
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm intrinsic probability of generating a correct answer can replace external verifiers as reward signal — extending rlvr to general domains