Can Large Reasoning Models Self-Train?

Paper · arXiv 2505.21444 · Published May 27, 2025

Scaling the performance of large language models (LLMs) increasingly depends on methods that reduce reliance on human supervision. Reinforcement learning from automated verification offers an alternative, but it incurs scalability limitations due to dependency upon human-designed verifiers. Self-training, where the model’s own judgment provides the supervisory signal, presents a compelling direction. We propose an online self-training reinforcement learning algorithm that leverages the model’s self-consistency to infer correctness signals and train without any ground truth supervision. We apply the algorithm to challenging mathematical reasoning tasks and show that it quickly reaches performance levels rivaling reinforcement-learning methods trained explicitly on gold-standard answers. Additionally, we analyze inherent limitations of the algorithm, highlighting how the self-generated proxy reward initially correlated with correctness can incentivize reward hacking, where confidently incorrect outputs are favored.

A particularly promising opportunity for such self-improvement arises in contexts characterized by a positive generation-verification gap (Song et al., 2025), where generating correct solutions is challenging, but verifying their correctness is relatively easy. This gap naturally arises in domains such as mathematics, games, and goal-based agentic tasks, where solutions often require complex reasoning or exploration, yet their correctness can be verified more easily.

Specifically, we leverage an intrinsic verification criterion, namely model self-consistency (Wang et al., 2023a), to produce supervisory rewards, capitalizing on the observation that consistency among model-generated answers correlates positively with their correctness. Conceptually similar approaches, relying either directly or indirectly on model-generated self-consistency signals, have been explored previously (e.g., Huang et al. (2023); Prasad et al. (2024)) in mathematical domains. Typically, such approaches have been confined to one or few rounds of improvement. In contrast, the key contribution of this work is to demonstrate that this intrinsic supervisory signal can continuously drive model improvement within a reinforcement learning (RL) framework (Sutton et al., 1998), iteratively enhancing performance without access to external labels.

An appealing feature of our proposed approach is its reliance solely on unlabeled data. Because our method does not depend on external labels, it can also be naturally applied in a “test-time training” fashion. Such flexibility allows models to iteratively boost their performance at inference time. Concurrently to our work, test-time training has also recently been investigated by Zuo et al. (2025). While the prospect of continual model self-improvement is appealing, it is important to recognize a significant limitation: the model’s self-generated reward serves only as a proxy for underlying correctness. Such proxy rewards may give rise to the phenomenon of reward hacking, whereby the model learns to maximize its self-assigned reward by generating increasingly consistent but potentially incorrect answers. This unintended behavior can reduce or even reverse the initial beneficial correlation between confidence and correctness, ultimately constraining sustained improvement. Understanding the precise dynamics of this process is thus essential for further advancement.