Reinforcement Learning for LLMs LLM Reasoning and Architecture

Do only 20 percent of tokens actually matter for reasoning?

Chain-of-thought reasoning might depend on a small minority of high-entropy tokens that act as decision points. If true, could training focus only on these critical tokens match or exceed full-gradient updates?

Note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? What does reward learning actually do to model reasoning?

In Chain-of-Thought reasoning, token entropy distribution follows a distinct pattern: the vast majority of tokens are generated with low entropy (completing ongoing linguistic structures), while a critical minority emerge with high entropy (functioning as pivotal decision points that determine the trajectory among multiple potential pathways). These high-entropy "forking tokens" are where the model actually decides between reasoning directions.

Three converging findings establish their primacy:

Causal role confirmed by intervention. Moderately increasing entropy of forking tokens during decoding measurably improves reasoning performance. Artificially reducing their entropy degrades it. The tokens are not just correlated with reasoning quality — they causally determine it.

RLVR primarily operates on forking tokens. Analysis of entropy evolution during RLVR training shows the reasoning model largely retains the base model's entropy patterns, with only gradual changes. Critically, RLVR primarily adjusts the entropy of high-entropy tokens while low-entropy tokens vary only minimally. The training signal is concentrated where it matters.

Sparse training matches or exceeds full training. Restricting policy gradient updates to the 20% highest-entropy tokens matches performance of full-gradient updates on Qwen3-8B and significantly surpasses full-gradient on Qwen3-32B (+11.04 on AIME'25) and Qwen3-14B (+4.79 on AIME'25). Training on the 80% lowest-entropy tokens leads to marked decline. This "beyond 80/20 rule" shows the minority carries the learning signal.

Since Does reinforcement learning update only a small fraction of parameters?, there is a striking parallel: RL operates on sparse critical subsets at both the parameter level (5-30% of parameters) and the token level (20% of tokens). The sparsity is not a limitation but a feature — concentrating the learning signal where it has leverage.

Since Which sentences actually steer a reasoning trace?, forking tokens are the token-level mechanistic correlate of thought anchors. Both identify critical decision points in reasoning, but at different granularities — thought anchors at the sentence level, forking tokens at the individual token level.


Source: RLVR

Related concepts in this collection

Concept map
16 direct connections · 152 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

high-entropy minority tokens are the critical forking points that drive rlvr effectiveness — restricting gradient updates to 20 percent of tokens matches or exceeds full updates