Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties

Paper · arXiv 2506.05744 · Published June 6, 2025
Reasoning ArchitecturesKnowledge GraphsCognitive Models LatentMechInterp

Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In this work, we introduce the notion of a reasoning graph, extracted by clustering hidden-state representations at each reasoning step, and systematically analyze three key graph-theoretic properties: cyclicity, diameter, and small-world index, across multiple tasks (GSM8K, MATH500, AIME 2024). Our findings reveal that distilled reasoning models (e.g., DeepSeek-R1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities. By bridging theoretical insights into reasoning graph structures with practical recommendations for data construction, our work advances both the interpretability and the efficacy of large reasoning models.

The base model demonstrates relatively simple and predominantly acyclic reasoning graphs. In contrast, the large reasoning model exhibits more complex structures, characterized by frequent cyclic patterns and broader node coverage.

To quantitatively validate these qualitative observations, we employed the cycle detection method introduced in Section 3.2. Figure 4 shows cycle detection rates for GSM8K, MATH500, and AIME 2024 datasets, comparing the large reasoning model (blue) with the base model (orange). The horizontal axis denotes different relative depths of hidden layers (0.1, 0.3, 0.5, 0.7, and 0.9), corresponding respectively to layers 6, 19, 32, 45, and 58 in the 64-layer Qwen2.5-32B. Across all layers, the large reasoning model consistently exhibited a notably higher frequency of cyclic reasoning graphs compared to the base model. Additionally, we observed higher cycle detection rates at the earlier and later layers, with lower detection rates in intermediate layers. This pattern suggests that intermediate layers compress token representations, making less cycle detection difficult, whereas layers closer to input or output exhibit clearer cyclic behaviors. The results for varying the hyperparameter k of the K-means clustering are provided in Appendix D, showing consistent trends across all tested values of k. Furthermore, another consistent trend emerges in which cycle detection ratios increase with the increasing complexity of tasks, progressing from GSM8K through MATH500 to AIME 2024. These findings reinforce the hypothesis that cycles within reasoning graphs contribute to the enhanced reasoning capabilities observed in large reasoning models.

In this work, we conducted an extensive analysis of reasoning graphs derived from large reasoning models, uncovering key structural properties that correlate with their enhanced performance. Our main findings highlight that large reasoning models consistently exhibit (1) greater cyclicity, (2) broader exploratory behaviors (larger diameters), and (3) pronounced small-world characteristics compared to base models. These insights suggest sophisticated structures in reasoning graphs as a critical factor driving reasoning improvements. Our results connect several observed behaviors in large reasoning models and offer implications for constructing more effective training datasets.

Aha Moment Models trained via RL have been reported to exhibit an intriguing phenomenon known as the “aha moment,” where the model reconsidered its intermediate answers during reasoning [9, 52]. From the perspective of our reasoning graph analysis, this phenomenon aligns consistently with the observed cyclic structures (as illustrated in Figure 1). Although the “aha moment” was initially identified as a phenomenon at the generated token level, our study quantitatively measures this behavior through the cycle properties of reasoning graphs, thereby contributing to a deeper mechanistic understanding of the “aha moment” from the internal states of LLMs.

Overthinking and Underthinking Recent studies have highlighted specific reasoning inefficiencies in large reasoning models. Overthinking, characterized by redundant or excessively long reasoning processes, has been frequently observed, particularly in agent-based tasks [27, 42, 7, 13]. Conversely, models in the o1 family display underthinking, rapidly switching thoughts without adequately exploring potentially valuable reasoning paths [48]. These phenomena align closely with the graph properties we have analyzed: redundant cyclic structures (discussed in Section 4.4) explain overthinking, while overly extensive exploratory behaviors (reflected in larger graph diameters, discussed in Section 4.2) may account for underthinking. Thus, our research clarifies these unique behaviors of large reasoning models through the lens of reasoning graph characteristics.