RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

Paper · arXiv 2508.00222 · Published July 31, 2025
RLVRReinforcement LearningReward ModelsEvolution

Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its inherently on-policy strategy with LLM’s immense action space and sparse reward. Further, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose RL-PLUS, a novel approach that synergizes internal exploitation (i.e., Thinking) with external data (i.e., Learning) to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components: Multiple Importance Sampling to address for distributional mismatch from external data, and an Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.

RLVR optimizes LLMs’ performance via a reinforcement learning process guided by verifiable reward computation, e.g., determining whether an output matches a ground-truth math answer or passes unit tests for coding. This method enables LLMs to scale their computation at test time by extending Chain-of-Thought (CoT) processes and spontaneously exhibit sophisticated cognitive behaviors such as reflection and exploration. In traditional RL, such as AlphaGo (Silver et al., 2016) and AlphaZero Silver et al. (2017), agents can explore new strategies autonomously, improving themselves to the point of even surpassing human-level performance Silver et al. (2016); Mnih et al. (2015). Similarly, RLVR is believed to be a promising way for LLMs to achieve continuous self-evolution toward more powerful AI (Guo et al., 2025).

This limitation stems from a fundamental challenge when applying RLVR to LLMs: the potential solution space of LLMs is extremely immense and sparse that current RLVR techniques cannot effectively guide the model to explore new and unknown pathways, i.e., outward exploration. The challenge is particularly acute in long reasoning tasks where rewards are contingent upon the successful completion of an entire inferential chain. A single erroneous step can nullify the reward for the entire trajectory, thus failing to provide a positive signal for acquiring new knowledge. Consequently, the model is compelled to focus on inward exploitation, meaning it refines and optimizes the knowledge and reasoning methods it already possesses. This results in a contraction of the model’s exploratory range and a shrinking of its capabilities. Not only does this prevent the model from acquiring new information or abilities that surpass its foundational base model, but it also significantly impedes any sustained enhancement of its overall performance.

Drawing inspiration from educational philosophy, “If one learns from others but does not think, one will be bewildered. If, on the other hand, one thinks but does not learn from others, one will be in peril” 3. Current RLVR can be viewed as the latter case, which excels at “thinking” through inward exploitation but demonstrates an inadequacy in “learning” through outward exploration. Conversely, approaches like Supervised Fine-Tuning (SFT) represent the former case, focusing on imitating solutions rather than acquiring underlying reasoning.

This motivates us to develop approaches of RLVR with effective external learning, but there are two key challenges that need to be addressed. First, a distributional mismatch between the model’s policy and the external data source is inevitable. Standard importance sampling corrections for RL are inadequate: employing an on-policy approach introduces systematic bias, whereas a direct off-policy approach suffers from high variance when the distributions diverge significantly. Second, there is the challenge of efficiently extracting valuable information from this external data. Models are naturally inclined to favor high-probability tokens, thereby reinforcing existing knowledge. However, the key to discovering novel reasoning often lies in exploring low-probability tokens that the model would otherwise ignore.

In this paper, we propose RL-PLUS, a novel approach designed to synergize external data (’Learning’) with internal exploitation (’Thinking’) during reinforcement learning process. RL-PLUS has two core techniques. ❶ To resolve the issue of distributional mismatch, we employ Multiple Importance Sampling. This technique provides a low-variance, unbiased estimation of rewards by combining information from multiple policies, effectively balancing the trade-offs between bias and variance. ❷ To promote the discovery of new knowledge, we introduce an Exploration-Based Advantage Function. This mechanism reshapes the learning objective by up-weighting advantages for reasoning paths that are correct but are hard to explore (i.e., low probability) under the current policy. This explicitly incentivizes the model to explore and learn from valuable information it would typically overlook. We also provide the theoretical analysis demonstrating that our approach achieves lower bias-variance compared to existing state-of-the-art (SOTA) RLVR method when leveraging external data.