R-Zero: Self-Evolving Reasoning LLM from Zero Data

Paper · arXiv 2508.05004 · Published August 7, 2025
EvolutionReinforcement LearningSelf Refinement Self Consistency Feedback

Self-evolving Large Language Models (LLMs) offer a scalable path toward superintelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles – a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver’s capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels.

To reduce dependence on human-curated data, self-generated and label-free methods have been proposed to eliminate the need for explicit supervision. In particular, label-free RL derives reward signals directly from the model’s own outputs, such as sequence-level confidence scores (Li et al., 2025a; Prabhudesai et al., 2025; Huang et al., 2025) and output entropy (Agarwal et al., 2025; Cheng et al., 2025). However, despite removing the need for explicit labels, label-free methods still relies on a pre-existing corpus of tasks, which limits its scalability in truly self-evolving settings. On the other side, self-challenging approaches train LLMs on tasks generated by the models themselves (Zhou et al., 2025; Wang et al., 2025a; Zhao et al., 2025a), While promising, many of these methods rely on external code executors to ensure that the synthesized tasks are both feasible and verifiable. However, in domains that lack an explicit verification oracle, such as open-ended reasoning, ensuring the quality and correctness of self-generated data remains a significant challenge.

In this paper, we propose R-Zero, a framework for training reasoning LLMs that can self-evolve from zero external data. In R-Zero, a single base model is initialized with two roles – a Challenger and a Solver that are independently optimized but co-evolve throughout the RL process. During co-evolving, the Challenger is rewarded for generating tasks targeted to be at the edge of Solver’s current abilities, while the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. Framework details are provided in Section 3, but briefly, in the Challenger training phase, the Challenger is trained via Group Relative Policy Optimization (GRPO) (Shao et al., 2024) to generate difficult questions. The reward signal is derived from the uncertainty for the frozen Solver, which is measured by the self-consistency of its multiple generated answers. In the Solver training phase, the Solver is fine-tuned with GRPO on a filtered set of these challenging questions generated by the now-frozen Challenger, using the pseudo-labels voted by itself. This entire process repeats, creating a self-evolving cycle that operates without any human intervention.