Why does representation recycling of MI-peak tokens improve reasoning accuracy?
This explores why a specific trick — taking the handful of high-information 'thinking' tokens (like 'Wait' and 'Therefore') and reusing their internal representations — boosts reasoning accuracy, and what the corpus says about why a tiny minority of tokens carries the reasoning signal.
This explores why representation recycling of mutual-information-peak tokens improves reasoning accuracy. The starting point is the finding that not all tokens in a reasoning chain pull equal weight: a small set of reflection and transition tokens — 'Wait,' 'Therefore,' 'So' — spike in mutual information with the correct answer, and suppressing exactly those tokens damages reasoning while suppressing the same number of random tokens does nothing Do reflection tokens carry more information about correct answers?. Recycling their representations (feeding them back into computation) works because these tokens are where the model's decision actually pivots — reuse amplifies the moment that mattered instead of diluting it across filler.
The corpus tells the same story from several independent angles, which is the real payoff here. Reinforcement learning research finds that only about 20% of tokens are high-entropy 'forking points,' and training exclusively on those matches or beats full-gradient updates Do high-entropy tokens drive reasoning model improvements?. A separate line shows models internally rank tokens by functional role — symbolic computation tokens are preserved while grammar and meta-discourse get pruned first, and students trained on these pruned chains outperform those trained on full ones Which tokens in reasoning chains actually matter most?. Three findings, three methods (information theory, RL gradients, likelihood pruning), all converging on the same claim: the reasoning signal lives in a sparse minority, so anything that concentrates compute there pays off.
What makes this genuinely strange is the flip side — much of the rest of the chain is scaffolding, not thought. Models trained on deliberately corrupted reasoning traces stay just as accurate, suggesting the visible text is computational structure rather than meaningful steps Do reasoning traces need to be semantically correct?. And transformers compute the answer in their early layers, then actively overwrite it with format-compliant filler before output Do transformers hide reasoning before producing filler tokens?. If the real reasoning is hidden and most surface tokens are filler, then recycling the rare information-dense tokens is a way of reaching back toward the computation the model already did and would otherwise bury.
This also reframes a paradox the corpus keeps surfacing: more thinking is not better. Accuracy peaks then declines as thinking tokens grow from ~1,100 to ~16K Does more thinking time always improve reasoning accuracy?, and reasoning degrades sharply with input length even far below the context limit Does reasoning ability actually degrade with longer inputs?. Length itself is a liability; signal density is the asset. That's why brevity can be steered into a model as a single activation direction without hurting accuracy Can we steer reasoning toward brevity without retraining?, and why some models drop visible thinking entirely and reason in latent space Can models reason without generating visible thinking tokens?.
The thing you didn't know you wanted to know: 'reasoning quality' may be less about how much a model thinks and more about whether it returns to the few moments where thinking actually turned. Recycling MI-peak tokens isn't adding intelligence — it's refusing to let the model overwrite the parts that already had it.
Sources 9 notes
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.