Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dualchannel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.
A dominant narrative emerging from these recent works (Chen et al., 2025b; Yue et al., 2025; Deng et al., 2025) interprets this progress through the lens of balancing exploration (searching for diverse reasoning paths) and exploitation (refining the most promising known strategies). However, this paradigm is almost exclusively rooted in a token-level analysis, where exploration is captured by high-entropy token distributions and exploitation by high-confidence, low-entropy ones. This has inevitably led to the widespread assumption of an inherent trade-off between the two, as a model’s output distribution is seen as unable to be simultaneously uniform and sharp (Agarwal et al., 2025).
This token-centric viewpoint, while convenient, introduces significant limitations. Equating exploration with mere token-level entropy faces an intrinsic dilemma (Fu et al., 2025; Qiao et al., 2025; Agarwal et al., 2025): excessively high entropy risks generating incoherent noise, while overly low entropy stifles exploration it aims to encourage. Similarly, defining exploitation via hand-crafted heuristic rewards (Chen et al., 2025a; Li et al., 2025a; Bensal et al., 2025) produces brittle models with poor generalizability as they simply learn to chase surface-level proxies. More related works in Sec. 7. While many works are aware to consider both exploration and exploitation as in Fig. 1a, their continued reliance on token-level metrics invariably traps them in a cycle of “balancing” the tradeoff, instead of doubting its existence. This raises a critical question: Is the exploration–exploitation trade-off intrinsic to reasoning, or merely an artifact of token-level measurement?
To answer this, we move beyond token-level statistics to investigate exploration and exploitation at the more granular hidden states level. To analyze these dynamics, we are the first to apply Effective Rank (ER) in an RL context, and take it to quantify exploration by measuring the semantic diversity of hidden-state representations. To capture exploitation, which we define as the efficient gain of information along a reasoning path, we propose two novel derivatives of ER. Effective Rank Velocity (ERV), the first-order change, measures the velocity of this information gain, while Effective Rank Acceleration (ERA), the second-order change, captures the trend of velocity, indicating whether reasoning is accelerating or saturating. Equipped with these tools, we uncover a striking result: at the hidden-state level, exploration and exploitation show near-zero correlation (Fig. 1b, bottom). This contrast provides strong evidence that the trade-off is not an inherent property of RLVR for reasoning but an artifact of biased token-level measurements. It reveals that these two capacities are not antagonistic but can, in fact, be decoupled and enhanced simultaneously (Fig. 1c). Furthermore, by grouping questions based on correctness score, Fig. 2 demonstrates that the relationship between exploration and exploitation exhibits a consistent pattern, regardless of the modeling granularity (e.g., token-level vs. hidden states level). This result underscores the effectiveness of our proposed method for modeling exploration and exploitation.
Building on this core insight, we propose Velocity-Exploiting Rank-Learning (VERL), a method that moves beyond the trade-off between the two capacities by directly shaping the RL advantage function using ER and ERV. Instead of acting as a switch between the two capacities in lower dimension, VERL functions as a tuner synergistically enhances both capacities in higer dimension. Its key innovation is leveraging ERA as a meta-control variable, a choice justified by our theoretical proof of its remarkable Op1q growth stability (Sec. 3). Specifically, VERL uses ERA as a dynamic signal to enhance the training incentives; Specifically, VERL uses ERA to create a synergistic, dualchannel incentive structure. Instead of switching between modes, it prospectively shapes the reward to simultaneously encourage exploration (via ER) to preempt overconfidence, while also reinforcing exploitative gains (via ERV) to consolidate the reasoning path. This unique stability makes ERA a robust signal to guide training, allowing VERL to simultaneously encourage exploration from productive-potential states while preventing overfitting to local optima. As a result, VERL delivers significant performance gains across diverse models and tasks, achieving up to a 21.4% absolute accuracy improvement on the challenging Gaokao 2024 benchmark.