Why does uniform averaging across all tokens dilute the reasoning signal?
This explores why treating every token in a reasoning chain as equally important — averaging signal uniformly across all of them — washes out the few tokens that actually carry the reasoning, and what the corpus has found about where that signal really lives.
This explores why treating every token equally — averaging the learning signal flat across a whole reasoning chain — buries the handful of tokens that actually matter. The short version from the corpus: reasoning isn't spread evenly across tokens, so an even average is the wrong operation. It hands most of its weight to the wrong words.
The sharpest evidence is about entropy. Only about 20% of tokens in a reasoning trace are high-entropy "forking points" — moments where the model is genuinely deciding which way to go — and reinforcement learning with verifiable rewards essentially only adjusts those. Train on that 20% alone and you match or beat updating on everything Do high-entropy tokens drive reasoning model improvements?. Uniform averaging dilutes precisely because the other 80% are low-stakes connective tissue; let them vote equally and the decisive tokens get drowned out. A complementary finding comes from pruning: models internally rank tokens by function, preserving symbolic-computation tokens first and discarding grammar and meta-discourse, and students trained on the pruned chains do better Which tokens in reasoning chains actually matter most?. The chain has a hierarchy; flat averaging pretends it doesn't.
The same logic shows up when you move from training to selecting traces. Global confidence averaging masks reasoning breakdowns — one bad step gets smoothed over by many fine ones — while step-level confidence catches the breakdown and even lets you stop early. Local beats global because a single failure point is what determines whether the whole trace is sound, and averaging hides it Does step-level confidence outperform global averaging for trace filtering?. That's the dilution problem in miniature: the average is dominated by the bulk, and the bulk is not where the answer turns.
Here's the part you didn't know you wanted to know: a lot of those tokens may not be reasoning at all. Transformers have been shown to compute the correct answer in their first few layers and then actively overwrite it to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. And models trained on deliberately corrupted, irrelevant traces perform about as well as those trained on correct ones — suggesting much of a trace is computational scaffolding, not meaningful steps Do reasoning traces need to be semantically correct?. If most tokens are scaffolding or suppressed filler, averaging over all of them isn't just diluting the signal — it's averaging signal together with noise that was never load-bearing.
This also reframes the "more thinking is better" instinct. Accuracy peaks and then declines as thinking tokens pile up Does more thinking time always improve reasoning accuracy?, and optimal chain length follows an inverted-U, with stronger models gravitating toward shorter chains Why does chain of thought accuracy eventually decline with length?. More tokens means a bigger denominator for your average and a thinner share for each genuinely pivotal one. The throughline across all of these: the reasoning signal is concentrated, and any method — training, filtering, or length-scaling — that spreads attention uniformly is fighting against where the model actually does its work.
Sources 7 notes
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.