Do attention scores predict which tokens will be pruned first?

This explores whether the attention an LLM assigns to a token is what decides whether that token gets dropped first when you compress or prune — and the corpus suggests attention is a surprisingly weak predictor compared to other importance signals.

This reads the question as: when models or pruning methods decide which tokens to throw away, do attention scores do the deciding? The collection's answer is mostly no — the signals that actually predict prunability are about a token's *function* and *information content*, not how much attention it receives. The most direct evidence comes from work on reasoning chains, where greedy likelihood-preserving pruning reveals that models internally rank tokens by functional category: symbolic computation tokens are preserved first while grammar and meta-discourse get pruned first Which tokens in reasoning chains actually matter most?. That ranking is about what a token *does*, not where attention lands.

Several other notes converge on better predictors than attention. Some tokens behave as information peaks — words like 'Wait' and 'Therefore' spike in mutual information with the correct answer, and suppressing them harms reasoning while suppressing random tokens doesn't Do reflection tokens carry more information about correct answers?. Relatedly, only about 20% of tokens are high-entropy 'forking points,' and training on just those matches full updates — the minority carries the signal Do high-entropy tokens drive reasoning model improvements?. Memory architectures make the same bet from the other direction, prioritizing *surprising* tokens for long-term storage rather than attention-heavy ones Can neural memory modules scale language models beyond attention limits?. Entropy, surprise, and mutual information keep beating raw attention as the 'keep this' criterion.

There's a deeper reason attention is a poor pruning oracle: it's structurally biased. Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating feedback loops that amplify framing Does transformer attention architecture inherently favor repeated content?. So if you pruned by lowest attention, you'd risk keeping redundant repeated content and discarding rare-but-pivotal tokens — exactly backwards from the functional and entropy-based rankings. This is also why some researchers treat attention as something to *optimize* rather than trust, making attention distributions themselves the policy target instead of a passive readout Can optimizing attention patterns improve multimodal RL better than optimizing tokens?.

Where the question gets interesting is that 'which tokens can be dropped' turns out to be task-dependent, not a fixed property. Single-QA tasks tolerate up to 95% sparsity because reasoning concentrates in a few tokens, while multi-hop and aggregation tasks degrade badly past 50% because they need attention spread across many regions How much sparsity can different reasoning tasks actually tolerate?. And sparsity done well isn't even a quality tradeoff — at equal compute, larger sparse-attention models beat smaller dense ones on long context Does sparse attention trade off quality for speed?.

The thing you might not have known you wanted to know: the field is quietly moving away from 'low attention means disposable' toward a richer toolkit — functional role, entropy, mutual information, surprise — precisely because attention's built-in bias toward prominent and repeated content makes it untrustworthy as the first thing you cut.

Sources 8 notes

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Reinforced Attention Learning treats attention patterns as the primary policy target rather than token sequences. Direct optimization of information allocation shows stronger gains on visual reasoning than standard RLHF, because attention is where the actual decision happens.

How much sparsity can different reasoning tasks actually tolerate?

Single-QA tasks tolerate 95% sparsity while multi-hop and aggregation tasks degrade substantially at 50-67% sparsity. This pattern reflects structural differences: single-QA concentrates reasoning in few tokens, while multi-hop and aggregation require distributed attention across multiple regions.

Does sparse attention trade off quality for speed?

The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.

Do attention scores predict which tokens will be pruned first?

Sources 8 notes

Next inquiring lines