Reinforcement Learning for LLMs LLM Reasoning and Architecture

Do only 20 percent of tokens actually matter for reasoning?

Chain-of-thought reasoning might depend on a small minority of high-entropy tokens that act as decision points. If true, could training focus only on these critical tokens match or exceed full-gradient updates?

Note · 2026-02-22 · sourced from RLVR

In Chain-of-Thought reasoning, token entropy distribution follows a distinct pattern: the vast majority of tokens are generated with low entropy (completing ongoing linguistic structures), while a critical minority emerge with high entropy (functioning as pivotal decision points that determine the trajectory among multiple potential pathways). These high-entropy "forking tokens" are where the model actually decides between reasoning directions.

Three converging findings establish their primacy:

Causal role confirmed by intervention. Moderately increasing entropy of forking tokens during decoding measurably improves reasoning performance. Artificially reducing their entropy degrades it. The tokens are not just correlated with reasoning quality — they causally determine it.

RLVR primarily operates on forking tokens. Analysis of entropy evolution during RLVR training shows the reasoning model largely retains the base model's entropy patterns, with only gradual changes. Critically, RLVR primarily adjusts the entropy of high-entropy tokens while low-entropy tokens vary only minimally. The training signal is concentrated where it matters.

Sparse training matches or exceeds full training. Restricting policy gradient updates to the 20% highest-entropy tokens matches performance of full-gradient updates on Qwen3-8B and significantly surpasses full-gradient on Qwen3-32B (+11.04 on AIME'25) and Qwen3-14B (+4.79 on AIME'25). Training on the 80% lowest-entropy tokens leads to marked decline. This "beyond 80/20 rule" shows the minority carries the learning signal.

Since Does reinforcement learning update only a small fraction of parameters?, there is a striking parallel: RL operates on sparse critical subsets at both the parameter level (5-30% of parameters) and the token level (20% of tokens). The sparsity is not a limitation but a feature — concentrating the learning signal where it has leverage.

Since Which sentences actually steer a reasoning trace?, forking tokens are the token-level mechanistic correlate of thought anchors. Both identify critical decision points in reasoning, but at different granularities — thought anchors at the sentence level, forking tokens at the individual token level.

Source: RLVR

Related concepts in this collection

Does reinforcement learning update only a small fraction of parameters? Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
parallel sparsity at parameter level and token level
Which sentences actually steer a reasoning trace? Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
forking tokens are the token-level correlate of sentence-level anchors
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
forking tokens are where entropy collapse matters most
Do hedging markers actually signal careful thinking in AI? Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
linguistic markers at forking points may signal reasoning quality
Where do memorization errors arise in chain-of-thought reasoning? Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.
both identify sparse tokens with disproportionate influence on reasoning; STIM adds the memorization-source dimension, showing that high-influence tokens may be driven by pattern-matching rather than reasoning
Do reflection tokens carry more information about correct answers? Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.
convergent evidence from information theory: MI peaks identify the same sparse-pivot structure from an information-theoretic perspective; high-entropy forking tokens during training correspond to MI-peak thinking tokens during inference, confirming that reasoning traces concentrate their signal at sparse critical junctures across both training and deployment

Concept map

16 direct connections · 152 in 2-hop network ·dense cluster

Do only 20 percent of tokens actually matter for… Does reinforcement learning update only a small fr… Which sentences actually steer a reasoning trace? Does policy entropy collapse limit reasoning perfo… Do hedging markers actually signal careful thinkin… Where do memorization errors arise in chain-of-tho… Do reflection tokens carry more information about …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

high-entropy minority tokens are the critical forking points that drive rlvr effectiveness — restricting gradient updates to 20 percent of tokens matches or exceeds full updates