Why does representation recycling of MI-peak tokens improve reasoning accuracy?

This explores why a specific trick — taking the handful of high-information 'thinking' tokens (like 'Wait' and 'Therefore') and reusing their internal representations — boosts reasoning accuracy, and what the corpus says about why a tiny minority of tokens carries the reasoning signal.

This explores why representation recycling of mutual-information-peak tokens improves reasoning accuracy. The starting point is the finding that not all tokens in a reasoning chain pull equal weight: a small set of reflection and transition tokens — 'Wait,' 'Therefore,' 'So' — spike in mutual information with the correct answer, and suppressing exactly those tokens damages reasoning while suppressing the same number of random tokens does nothing Do reflection tokens carry more information about correct answers?. Recycling their representations (feeding them back into computation) works because these tokens are where the model's decision actually pivots — reuse amplifies the moment that mattered instead of diluting it across filler.

The corpus tells the same story from several independent angles, which is the real payoff here. Reinforcement learning research finds that only about 20% of tokens are high-entropy 'forking points,' and training exclusively on those matches or beats full-gradient updates Do high-entropy tokens drive reasoning model improvements?. A separate line shows models internally rank tokens by functional role — symbolic computation tokens are preserved while grammar and meta-discourse get pruned first, and students trained on these pruned chains outperform those trained on full ones Which tokens in reasoning chains actually matter most?. Three findings, three methods (information theory, RL gradients, likelihood pruning), all converging on the same claim: the reasoning signal lives in a sparse minority, so anything that concentrates compute there pays off.

What makes this genuinely strange is the flip side — much of the rest of the chain is scaffolding, not thought. Models trained on deliberately corrupted reasoning traces stay just as accurate, suggesting the visible text is computational structure rather than meaningful steps Do reasoning traces need to be semantically correct?. And transformers compute the answer in their early layers, then actively overwrite it with format-compliant filler before output Do transformers hide reasoning before producing filler tokens?. If the real reasoning is hidden and most surface tokens are filler, then recycling the rare information-dense tokens is a way of reaching back toward the computation the model already did and would otherwise bury.

This also reframes a paradox the corpus keeps surfacing: more thinking is not better. Accuracy peaks then declines as thinking tokens grow from ~1,100 to ~16K Does more thinking time always improve reasoning accuracy?, and reasoning degrades sharply with input length even far below the context limit Does reasoning ability actually degrade with longer inputs?. Length itself is a liability; signal density is the asset. That's why brevity can be steered into a model as a single activation direction without hurting accuracy Can we steer reasoning toward brevity without retraining?, and why some models drop visible thinking entirely and reason in latent space Can models reason without generating visible thinking tokens?.

The thing you didn't know you wanted to know: 'reasoning quality' may be less about how much a model thinks and more about whether it returns to the few moments where thinking actually turned. Recycling MI-peak tokens isn't adding intelligence — it's refusing to let the model overwrite the parts that already had it.

Sources 9 notes

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning systems analyst. The question remains open: Why does representation recycling of MI-peak tokens improve reasoning accuracy, and does this mechanism still hold as model scale, inference budgets, and training paradigms shift?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat these as perishable snapshots.
• A sparse minority (~20%) of tokens—'Wait,' 'Therefore,' 'So'—spike in mutual information with correct answers; recycling their representations amplifies decision pivots (2025-06, arXiv:2506.02867).
• Only ~20% of tokens are high-entropy 'forking points' that drive RL updates; training exclusively on these matches or beats full-gradient baselines (2025-06, arXiv:2506.01939).
• Models internally rank tokens by functional role; reasoning chains survive pruning of grammar/meta-discourse, and students trained on pruned chains outperform full-chain baselines (2026-01, arXiv:2601.03066).
• Reasoning accuracy peaks then degrades as thinking tokens grow from ~1,100 to ~16K; length is a liability, signal density the asset (2025-06, arXiv:2506.04210).
• Real reasoning occurs in early transformer layers; models then actively overwrite answers with format-compliant filler before output (2024-12, arXiv:2412.04537).

Anchor papers (verify; mind their dates):
• arXiv:2506.02867 (2025-06): Mutual Information peak tokens
• arXiv:2506.01939 (2025-06): High-entropy minority tokens in RL
• arXiv:2601.03066 (2026-01): Functional importance encoding
• arXiv:2506.04210 (2025-06): Test-time scaling saturation

Your task:
(1) RE-TEST EACH CONSTRAINT. For recycling mechanisms: probe whether newer inference-time interventions (e.g., speculative decoding, adaptive compute routing, multi-head routing to sparse expert layers), or post-training methods (DPO, IPO, synthetic preference sets) have since RELAXED the reliance on MI-peak identification or enabled cheaper identification. Judge whether the core claim—that *reuse* of rare high-signal tokens outperforms reuse of random tokens—still separates durable insight from perishable limitation.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. In particular, flag any findings that: (a) show MI-peak recycling offers no advantage over simpler token filtering, (b) demonstrate that length penalties vanish under specific training regimes, or (c) claim reasoning quality is *not* sparse.
(3) Propose 2 research questions that ASSUME the regime may have moved: one addressing whether MI-peak detection itself is now automated/learned rather than post-hoc, and one testing whether recycling generalizes across reasoning domains (math, code, open-ended) or breaks down in low-structure tasks.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does representation recycling of MI-peak tokens improve reasoning accuracy?

Sources 9 notes

Next inquiring lines