Do LLMs Encode Functional Importance of Reasoning Tokens?

Paper · arXiv 2601.03066 · Published January 6, 2026
Reasoning ArchitecturesMechInterpCognitive Models Latent

Large language models solve complex tasks by generating long reasoning chains, achieving higher accuracy at the cost of increased computational cost and reduced ability to isolate functionally relevant reasoning. Prior work on compact reasoning shortens such chains through probabilistic sampling, heuristics, or supervision from frontier models, but offers limited insight into whether models internally encode token-level functional importance for answer generation. We address this gap diagnostically and propose greedy pruning, a likelihoodpreserving deletion procedure that iteratively removes reasoning tokens whose removal minimally degrades model likelihood under a specified objective, yielding length-controlled reasoning chains. We evaluate pruned reasoning in a distillation framework and show that students trained on pruned chains outperform a frontier-model–supervised compression baseline at matched reasoning lengths. Finally, our analysis reveals systematic pruning patterns and shows that attention scores can predict greedy pruning ranks, further suggesting that models encode a nontrivial functional importance structure over reasoning tokens.

Large Language Models (LLMs) rely on long, explicit reasoning chains to solve complex tasks in mathematics, science, and multi-step decision making (Wei et al., 2022; El-Kishky, 2024; Guo et al., 2025a). While long reasoning chains improve performance, their length incurs substantial costs, including higher inference latency, increased training and memory requirements, and greater difficulty in isolating parts of the reasoning that are functionally responsible for the final answer. This trade-off has motivated a growing line of work on producing more efficient and compact reasoning chains that preserve task performance (Liu et al., 2024; Kang et al., 2025; Aggarwal and Welleck, 2025; Feng et al., 2025). Most existing approaches to compact reasoning achieve compression by explicitly exploring the space of reasoning chains and selecting shorter alternatives. One prominent mechanism is temperature sampling, in which methods generate multiple candidate chains and train models to produce shorter chains among them (Hassid et al., 2025; Aggarwal and Welleck, 2025; Luo et al., 2025; Hou et al., 2025). Other mechanisms include LLM probability-based heuristics or postprocessing rules for removing reasoning segments, as well as the use of frontier LLMs to generate candidate compact reasoning chains (Kang et al., 2025; Xia et al., 2025; Qiao et al., 2025). Although effective, these approaches provide limited insight into whether the reasoning LLM internally encodes token-level functional importance in its reasoning for answer generation.

Compared to prior work, we take a different perspective and ask a more fundamental question: Do LLMs internally encode the functional importance of reasoning tokens for answer generation? Rather than proposing another method for generating compact reasoning, we study whether a model’s own likelihood and attention patterns can serve as signals for ranking reasoning tokens by their functional importance. This framing treats reasoning compression as a diagnostic problem aimed at revealing internally encoded token-level functional structure in the model’s reasoning.

To this end, we introduce greedy pruning, a likelihood-preserving deletion procedure inspired by perturbation-based attribution methods (Zeiler and Fergus, 2014; Zintgraf et al., 2017) and greedy decoding. Perturbation-based attribution methods assess the importance of input parts by perturbing or removing them and observing the resulting changes in the model’s output, while greedy decoding in LLMs incrementally adds tokens to maxi- mize likelihood. Building on these ideas, greedy pruning starts from a complete reasoning chain and incrementally removes tokens whose deletion minimally degrades the model’s likelihood under a specific objective (described in Section 3.2). This process produces a ranking over reasoning tokens and a set of length-controlled reasoning chains, where pruning ranks reflect functional importance under the model’s own distribution. Under this interpretation, greedy pruning serves as a probe for identifying reasoning-token subsequences important for maintaining predictive behavior.

To test whether greedily pruned reasoning chains indeed preserve functionally important tokens, we evaluate them using a teacher–pruner–student distillation framework across multiple reasoning benchmarks, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and the multi-domain benchmark MMLU-Pro (Wang et al., 2024). Quantitatively, we show that students distilled on greedily pruned reasoning chains outperform multiple pruning baselines, including TokenSkip (Xia et al., 2025), which relies on token importance labels from frontier models. Qualitatively, we analyze the functional structure of pruned reasoning chains, revealing consistent patterns in which types of tokens are preserved or removed at different pruning stages. Finally, we examine whether pruning ranks can be predicted from attention scores, providing evidence that importance signals revealed by greedy pruning are accessible from the model’s own internals.

Overall, our results suggest that LLMs encode a nontrivial token-level functional structure within reasoning, which can be revealed through likelihood-preserving pruning and corroborated by attention-based signals. Beyond reasoning compression, greedy pruning offers a principled tool for probing the internal organization of LLM-generated reasoning.

5.1 Functional Structure Under Pruning To analyze which reasoning tokens are preferentially preserved or removed under greedy pruning, we annotate each reasoning token by its functional role in reasoning, complementing prior reasoning-step–level analyses of reasoning structure (Marjanovi´c et al., 2025) with a token-level functional characterization. This annotation allows us to track how different functional components evolve across pruning stages and to characterize the emergent structure of pruned reasoning chains. We conduct this analysis on 1,000 randomly sampled GSM8K examples, as GSM8K offers particularly interpretable token-level functional categories; extending this analysis to additional datasets is left for future work. Specifically, we define six coarse, interpretable functional categories: SYMBMATH (explicit equations and mathematical symbols), METADISC (reasoning narration and structuring), COREF (pronouns and referential expressions), ENTNAME (concrete entities), VERBALMATH (natural-language arithmetic relations), and GRAMMAR (grammatical fillers). Annotation is performed using gpt-5-mini and validated through manual inspection and stability checks; details are provided in Appendix D.1 and D.2.

Using greedy pruning, a likelihood-preserving deletion procedure, we show that models can systematically compress reasoning while retaining information critical for answer generation, and that students trained on such pruned reasoning outperform multiple baselines at matched reasoning lengths. Our analyses reveal a stable and interpretable functional structure under pruning, with symbolic computation preferentially preserved and supporting linguistic scaffolding pruned earlier, and further show that pruning behavior depends on the pruning objective and pruner strength. We also demonstrate that pruning ranks are dynamically reshaped as context contracts and are strongly predictable from attention patterns, indicating that models encode internal signals of reasoning-token importance. Together, these findings establish greedy pruning as a principled diagnostic method for exposing token-level importance structure in model-generated reasoning.