What makes some tokens carry disproportionate information about answers?
This explores why a small fraction of tokens inside a model's reasoning carry most of the signal about whether the answer is right — and how researchers spot those tokens.
This explores why a small fraction of tokens inside a model's reasoning carry most of the signal about whether the answer is right — and how researchers find them. The corpus converges on a striking pattern: information about correct answers isn't spread evenly across a model's output. It clusters. Certain tokens — often the connective and reflective ones like "Wait" and "Therefore" — spike sharply in mutual information with the correct answer, and suppressing those specific tokens damages reasoning while suppressing the same number of random tokens does almost nothing Do reflection tokens carry more information about correct answers?. The information isn't in the bulk of the text; it's in the hinges.
Why those particular tokens? Several notes triangulate the same underlying mechanism from different angles. One framing is decision uncertainty: only about 20% of tokens show high entropy, and those high-entropy tokens are the "forking points" where reasoning could branch one way or another — train a reasoning model on just those and you match full-gradient performance Do high-entropy tokens drive reasoning model improvements?. A complementary framing is variance across attempts: the tokens that matter are the ones whose certainty swings depending on which chain of thought came before them, while most tokens stay stable no matter what — and you can find them from the model's own samples without any labels Can we identify which tokens actually matter for reasoning?. High entropy and high cross-rollout variance are two readings of the same thing: a token carries disproportionate information precisely when the answer is still genuinely up for grabs at that point.
There's also a functional-role story underneath the statistical one. When researchers prune reasoning chains by what the model can afford to lose, symbolic computation tokens get preserved first while grammar and meta-commentary get cut — the model is implicitly ranking its own tokens by how much work they do Which tokens in reasoning chains actually matter most?. So "disproportionate information" maps onto a small set of load-bearing categories, not a random scatter.
The most unsettling thread complicates the whole picture: the visible token isn't always where the information lives. In models trained with hidden chain-of-thought, the correct answer is computed in the first few layers and then actively overwritten so the final output is format-compliant filler — the real reasoning is recoverable only from lower-ranked predictions Do transformers hide reasoning before producing filler tokens?. That means a token can carry enormous information internally while looking inert on the surface, which is the inverse of the "Wait/Therefore" case where the informative token is right there in plain sight.
If you want to pull the lens back: the same "the minority carries the signal" logic shows up beyond single tokens — agentic research systems hit a test-time scaling curve where a few high-value search steps drive answer quality the way pivotal reasoning tokens do Does search budget scale like reasoning tokens for answer quality?. The recurring lesson across the collection is that information about answers is sparse and concentrated, whether the unit is a token, a reasoning step, or a search query — and the practical payoff is that you can find and act on the 20% that matters without touching the rest.
Sources 6 notes
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
A small subset of tokens in reference answers change their certainty sharply depending on which chain of thought precedes them, while most tokens remain stable. This variance pattern, computable from the model's own samples, identifies reasoning-bearing tokens without supervision.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.