Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Paper · arXiv 2506.01939 · Published June 2, 2025

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model’s entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3- 32B (+11.04 on AIME’25 and +7.71 on AIME’24) and Qwen3-14B (+4.79 on AIME’25 and +5.21 on AIME’24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.

In this paper, we analyze the underlying mechanisms of RLVR through an innovative lens of token entropy patterns, investigating how tokens with varying entropy impact reasoning performance. We first point out that in the Chain-of-Thought (CoT) processes of LLMs, the entropy distribution exhibits a distinct pattern where the majority of tokens are generated with low entropy, while a critical minority of tokens emerge with high entropy. Through comparing the textual meanings of these two parts of tokens, we observe that the tokens with lowest average entropy primarily complete the ongoing linguistic structures, while the tokens with highest average entropy function as pivotal decision points that determine the trajectory of reasoning among multiple potential pathways (referred to as forks), as depicted in Figure 1(a). In addition to qualitatively analysis, we conduct controlled experiments by manually modulating the entropy of forking tokens during decoding. Quantitative results reveal that moderately increasing the entropy of these high-entropy forking tokens leads to measurable improvements in reasoning performance, while artificially reducing their entropy results in performance degradation, confirming the importance of maintaining high entropy and the role as "forks" for these high-entropy tokens. Furthermore, by analyzing the evolution of token entropy during RLVR training, we find that the reasoning model largely retains the entropy patterns of the base model, exhibiting only gradual and relatively minor changes as training progresses. Additionally, RLVR primarily changes the entropy of high-entropy tokens, while the entropy of low-entropy tokens varies only within a small range. The above observations highlight the critical role high-entropy minority tokens may play in CoTs and RLVR training.