Learning to Reason without External Rewards

Paper · arXiv 2505.19590 · Published May 26, 2025

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose INTUITOR, an RLIF method that uses a model’s own confidence—termed self-certainty—as its sole reward signal. INTUITOR replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning.

Despite these successes, both RLHF and RLVR face fundamental limitations that constrain their broader applicability. RLHF requires extensive human annotation, making it expensive and potentially biased [Gao et al., 2023]. RLVR, while avoiding learned reward models, demands domain-specific verifiers and gold-standard solutions. In mathematics, this requires expert annotation of solutions; in code generation, it necessitates comprehensive test suites and execution environments [Liu et al., 2023, Liu and Zhang, 2025, Team et al., 2025, Xiaomi LLM-Core Team, 2025]. These requirements limit RLVR to carefully curated domains and complicate deployment in open-ended scenarios. Moreover, outcome-oriented verifiable rewards limit transferability to other domains. These challenges motivate exploration of more general and scalable reward paradigms, leading to a critical research question: Can LLMs enhance their reasoning abilities by relying solely on intrinsic, self-generated signals, without recourse to external verifiers or domain-specific ground truth?

In this paper, we introduce and explore such a paradigm: Reinforcement Learning from Internal Feedback (RLIF), where models optimize intrinsic feedback to improve performance without external rewards or supervision. The motivation for RLIF extends to future scenarios where models develop superhuman capabilities that become difficult for humans to evaluate directly [Burns et al., 2023], requiring self-improvement through intrinsic mechanisms [Oudeyer and Kaplan, 2007].

Under the RLIF paradigm, we propose INTUITOR, a novel reinforcement learning approach leveraging a model’s own confidence as an intrinsic reward. This builds on observations that LLMs exhibit lower confidence on difficult problems [Farquhar et al., 2024, Kuhn et al., 2023, Kang et al., 2024, 2025]; optimizing for confidence should improve reasoning capabilities. Specifically, we use selfcertainty [Kang et al., 2025], the average KL divergence between the model’s output distribution and a uniform distribution, as our confidence measure. This metric has proven useful for distinguishing high-quality responses from flawed ones [Kang et al., 2025, Ma et al., 2025]. Building on this insight, INTUITOR guides learning through self-generated signals, eliminating the need for external supervision or handcrafted rewards. The implementation of INTUITOR is simple, efficient, and effective: we replace the verifiable reward signal in existing RLVR frameworks, specifically Group Relative Policy Optimization (GRPO) [Shao et al., 2024], with self-certainty scores, using the same policy gradient algorithm.