Which tokens in reasoning chains actually matter most?

Do language models internally rank tokens by functional importance? Greedy pruning experiments explore whether models preserve symbolic computation while discarding linguistic scaffolding, and what this reveals about reasoning architecture.

Note · 2026-04-18 · sourced from Reasoning Architectures

Reasoning chains are not homogeneous sequences where every token contributes equally. Greedy pruning — iteratively deleting the token whose removal least changes the model's output likelihood — reveals that models internally rank tokens by functional importance. Six distinct functional categories emerge from the pruning order: SYMBMATH (symbolic computation), METADISC (meta-discourse like "let's think"), COREF (coreference), ENTNAME (entity names), VERBALMATH (verbalized math reasoning), and GRAMMAR (grammatical connectives).

The pruning hierarchy is consistent: symbolic computation tokens are preferentially preserved while linguistic scaffolding — grammar, meta-discourse, verbal math narration — is pruned first. This means the model "knows" which tokens are load-bearing for the answer and which are stylistic packaging.

Two implications sharpen existing findings:

First, this provides a mechanistic complement to Do reflection tokens carry more information about correct answers?. MI peaks identify important tokens via information theory; greedy pruning identifies them via likelihood preservation. The convergence across methods strengthens the sparse-pivot structure claim — but with a twist: MI peaks highlight reflection tokens ("Wait," "Hmm") while functional importance highlights symbolic computation tokens. Reflection tokens may be important for the reasoning process while symbolic tokens are important for the reasoning answer — a process-vs-product distinction within the same trace.

Second, the finding that student models trained on greedy-pruned chains outperform those trained on frontier-model-supervised compression is striking. The model's own internal importance ranking produces better training signal than an external teacher's judgment about what to keep. This extends the logic of Which sentences actually steer a reasoning trace? from analysis to training: the structural hierarchy within reasoning traces is not just observable but exploitable for more efficient distillation.

The attention-score prediction finding (attention scores predict pruning ranks) suggests that the model's attention mechanism already implements a form of importance weighting that could enable training-free chain compression at inference time.

Source: Reasoning Architectures Paper: "Functional Importance of Reasoning Tokens" (2601.03066)

Original note title

reasoning chains encode token-level functional importance — models internally rank which tokens matter and linguistic scaffolding is pruned first