The Invisible Leash: Why RLVR May Not Escape Its Origin

Paper · arXiv 2507.14843 · Published July 20, 2025

Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI’s capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model’s reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model’s support—unable to sample solutions with zero initial probability—and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy–reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy—resulting in greater uncertainty at each generation step—answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.

Recent studies offer divergent perspectives on this question. On the one hand, several works (Yue et al., 2025a; Zhao et al., 2025b; Shah et al., 2025; Ma et al., 2025; He et al., 2025) highlight a paradoxical failure mode: while RLVR-trained models outperform their base models on pass@k at low sampling budgets (e.g., k = 1), base models achieve higher pass@k scores when k is large, suggesting a narrowing of the reasoning horizon after RLVR training. Some even report that RLVR-trained models benefit from seemingly random or spurious reward signals (Shao et al., 2025), raising questions about whether the observed improvements genuinely reflect enhanced reasoning. On the other hand, Liu et al. (2025) report that previous studies focused primarily on special domains such as math, where base models may have been over-trained, which can then lead to premature termination of RLVR unless the level of entropy is carefully controlled. They then demonstrate that RLVR can expand the reasoning horizon considerably on certain domains, such as Reasoning Gym, where the base models struggle, with marked improvement on pass@k at large k.