What makes thinking tokens carry more information than other tokens?
This explores why a small subset of tokens in a reasoning chain — words like 'Wait' or 'Therefore' — seem to carry disproportionate weight for getting the right answer, and what 'information' even means in that claim.
This explores why a small subset of tokens in a reasoning chain — words like "Wait" or "Therefore" — seem to carry disproportionate weight for getting the right answer. The corpus has a surprisingly sharp answer: certain tokens spike in *mutual information* with the correct answer, meaning their presence measurably shifts the model toward solving the problem. Suppress these reflection and transition tokens and accuracy collapses; suppress an equal number of random tokens and nothing happens Do reflection tokens carry more information about correct answers?. So the information isn't spread evenly across the chain — it's concentrated in a few load-bearing pivots.
Several notes converge on this 'minority carries the signal' picture from different angles. One finds that only ~20% of tokens are high-entropy 'forking points' where the model genuinely decides between paths, and that reinforcement training basically only adjusts these — train on just the forking 20% and you match full-gradient performance Do high-entropy tokens drive reasoning model improvements?. Another shows you can identify the reasoning-bearing tokens without any labels: they're the ones whose certainty swings wildly depending on which chain of thought preceded them, while most tokens stay stable regardless Can we identify which tokens actually matter for reasoning?. A third finds models internally rank tokens by function, preferentially preserving symbolic-computation tokens while pruning grammar and filler — and students trained on the pruned chains actually do better Which tokens in reasoning chains actually matter most?. Three different measurement strategies, same conclusion: information lives at the decision points, not the connective tissue.
Here's the twist worth sitting with. A skeptical strand of the corpus argues these tokens may be *correlated* with correctness rather than *causing* it. One note shows R1's intermediate tokens are generated identically to any other output, carry no special execution semantics, and that invalid reasoning traces frequently still produce correct answers — the trace is learned formatting, not functional computation Do reasoning traces actually cause correct answers?. The broader finding that chain-of-thought is 'pattern-guided generation, not formal logic' — where format matters 7.5× more than content and invalid prompts work as well as valid ones — pushes the same way What makes chain-of-thought reasoning actually work?. So 'carries more information' may mean these tokens are reliable *signals of* a good reasoning state, not levers that *create* one. The mutual-information spike is real either way; what's contested is the direction of the arrow.
The practical payoff is that information density and token *quantity* are opposites. More thinking doesn't mean more information — accuracy peaks then declines as chains balloon from ~1,100 to 16K tokens Does more thinking time always improve reasoning accuracy?, correct traces are consistently *shorter* than incorrect ones because long traces accumulate self-revisions that compound errors Why do correct reasoning traces contain fewer tokens?, and the optimal cutoff is invisible until you cross it How can we predict the optimal thinking token threshold?.
If you want the strangest doorway: some architectures scale reasoning entirely in *latent space*, with no verbalized tokens at all Can models reason without generating visible thinking tokens?. That hints the 'information' in thinking tokens might be a readout of an internal computation rather than the computation itself — the spoken word is a window onto the work, and a few words happen to be the clearest panes of glass.
Sources 10 notes
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
A small subset of tokens in reference answers change their certainty sharply depending on which chain of thought precedes them, while most tokens remain stable. This variance pattern, computable from the model's own samples, identifies reasoning-bearing tokens without supervision.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Across QwQ, DeepSeek-R1, and LIMO, correct solutions average fewer tokens than incorrect ones. Longer traces correlate with more self-revisions, which introduce and compound errors rather than improve reasoning quality.
The overthinking threshold depends on task difficulty, model training, and domain, but remains invisible until crossed. Recent work suggests difficulty estimators and runtime confidence signals can detect thresholds dynamically.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.