RL with Verifiable Rewards (RLVR)
Related topics:
- 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language ModelsTherefore, several replication studies have explored strategies for efficiently creating training datasets by leveraging open-source data and powerful models. In this subsection, we introduce the data…
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?2 What to Scale “What to scale” refers to the specific form of TTS that is expanded or adjusted to enhance an LLM’s performance during inference. When applying TTS , researchers typically choose a sp…
- Absolute Zero: Reinforced Self-play Reasoning with Zero DataReinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR wo…
- Beyond Scaling Law: A Data-Efficient Distillation Framework for ReasoningLarge language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded…
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM ReasoningReinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well …
- Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVRA prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-le…
- Can Large Language Models Capture Human Annotator Disagreements?Human annotation variation (i.e., annotation disagreements) is common in NLP and often reflects important information such as task subjectivity and sample ambiguity. While Large Language Models (LLMs)…
- Checklists Are Better Than Reward Models For Aligning Language ModelsLanguage models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmful…
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse DomainsHowever, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse realworld domains including medicin…
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly in mathematics and …
- Escaping the Verifier: Learning to Reason via DemonstrationsTraining Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite off…
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVRReinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLV…
- GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement LearningReinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs), particularly in the domain of c…
- GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative ReasoningRecent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face …
- Learning to Reason for FactualityReasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning c…
- Learning to Reason without External RewardsTraining large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We ex…
- Local Coherence or Global Validity? Investigating RLVR Traces in Math DomainsReinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks and continues to attract significant att…
- Part I: Tricks or Traps? A Deep Dive into RL for LLM ReasoningReinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. …
- Post-Training Large Language Models via Reinforcement Learning from Self-FeedbackLarge Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self- Feedback (RLSF…
- Process Reward Models That ThinkStep-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to bui…
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy OptimizationReinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent cap…
- RLP: Reinforcement as a Pretraining ObjectiveThe dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning…
- RLPR: Extrapolating RLVR to General Domains without VerifiersReinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical an…
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon AgentsThe development of autonomous agents for complex, long-horizon tasks is a central goal in AI. However, dominant training paradigms face a critical limitation: reinforcement learning (RL) methods that …
- Reinforced Language Models for Sequential Decision MakingLarge Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a ne…
- Reinforcement Learning for Reasoning in Large Language Models with One Training ExampleWe show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of large language models (LLMs…
- Reinforcement Learning with Rubric AnchorsReinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI’s o-series. In RLVR, rewards a…
- Reinforcement Pre-TrainingIn this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a rea…
- Reinforcing General Reasoning without VerifiersThe recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and ma…
- Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsExtending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unamb…
- Spurious Rewards: Rethinking Training Signals in RLVRWe show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlat…
- StepWiser: Stepwise Generative Judges for Wiser ReasoningAs models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Proces…
- Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-FutureSelf-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improvi…
- Tulu 3: Pushing Frontiers in Open Language Model Post-TrainingLanguage model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary o…
- rStar2-Agent: Agentic Reasoning Technical ReportWe introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognit…