Why do high entropy tokens carry most of the learning signal in RL?

This explores why, when reinforcement learning trains a reasoning model, the gradient seems to concentrate on the small set of tokens where the model is most uncertain — and what that tells us about where 'reasoning' actually lives.

This explores why a minority of uncertain tokens, not the whole sequence, seems to carry the RL learning signal. The cleanest answer in the corpus is that only about 20% of tokens are high-entropy 'forking points' — moments where the model is genuinely deciding which way the reasoning goes — and RLVR mostly adjusts exactly those. Training on that 20% alone matches or beats updating on every token Do high-entropy tokens drive reasoning model improvements?. The intuition: most tokens are low-entropy filler the model would emit no matter what; there's nothing to learn there. The branch points are where a different choice changes the outcome, so that's where reward has leverage.

There's a deeper, almost paradoxical reason this matters: entropy is also the thing RL tends to destroy. Policy entropy reliably collapses during training, and performance saturates as it approaches zero on a predictable curve — once the model stops being uncertain, it stops being able to improve Does policy entropy collapse limit reasoning performance in RL?. So high-entropy tokens are doing double duty: they're where the signal is *and* the resource that gets spent down. That reframes a lot of RL tuning (Clip-Cov, KL-Cov and similar) as deliberately protecting entropy at exactly the tokens that carry the learning — spend it too fast and the model converges to a narrow strategy with nothing left to explore.

What's striking is that this same 'narrow region does the work' pattern shows up at every level of the model, not just at the token level. Across seven algorithms and ten model families, RL updates only 5–30% of parameters — and those sparse updates are near full-rank and nearly identical across random seeds, meaning the network itself has a structural sub-region that RL targets Does reinforcement learning update only a small fraction of parameters?. Models even sparsify their own activations as tasks get harder, as if concentrating compute where the decision is Do language models sparsify their activations under difficult tasks?. High-entropy tokens are the behavioral face of the same phenomenon: learning is concentrated, not diffuse.

The entropy isn't evenly distributed across *kinds* of tokens either. RL training moves through two phases — first execution correctness drives gains, then strategic planning becomes the bottleneck. Crucially, planning-token entropy rises while execution-token entropy stabilizes, and pushing optimization onto the planning tokens yields the biggest jumps Does RL training follow a predictable two-phase learning sequence?. So 'high-entropy = high signal' isn't a fixed set of tokens; it migrates toward wherever the current decision frontier sits. One provocative line of work even argues the token is the wrong unit entirely — that optimizing the model's *attention distribution* (where it's deciding what to look at) beats token-level RL on multimodal reasoning, because attention is where the real fork happens Can optimizing attention patterns improve multimodal RL better than optimizing tokens?.

The thing worth carrying away: concentrating learning on high-entropy tokens is efficient, but it's the same mechanism that quietly squeezes a model's diversity. The exact entropy-collapse dynamic that focuses the signal in reasoning also compresses exploration in search agents, where SFT on varied demonstrations is needed to claw breadth back Does reinforcement learning squeeze exploration diversity in search agents?. So the answer to 'why do high-entropy tokens carry the signal' comes bundled with a warning — the budget of uncertainty they represent is finite, and most of the art in RL for reasoning is deciding how slowly to spend it.

Sources 7 notes

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Reinforced Attention Learning treats attention patterns as the primary policy target rather than token sequences. Direct optimization of information allocation shows stronger gains on visual reasoning than standard RLHF, because attention is where the actual decision happens.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Why do high entropy tokens carry most of the learning signal in RL?

Sources 7 notes

Next inquiring lines