The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.
Recently, large language models (LLMs) have excelled on diverse reasoning tasks via explicit chain-of-thought (CoT) rationales [1–4]. Yet, they struggle to cold-start from instruction-tuned or base models into Long CoT models requiring extended multi-step reasoning [5, 6]. Notably, Du et al. [7] shows that humans generate Long CoT rationales without imitating DeepSeek-R1 [8]. Our preliminary studies reveal that standard supervised fine-tuning and distillation from human or Instruction LLM rationales (using randomly sampled Long CoT examples) fail to reliably instill these skills in LLMs. Models often lose coherence over long trajectories or fail to transfer patterns to novel tasks. This prompts a key question:
How do Large Language Models learn and represent effective Long Chain-of-Thought?
To explain this, we posit that they acquire the organization of reasoning trajectories. As shown in Figure 2, prior studies model these as logic nodes in sequences or trees of steps. Yet, our analysis of Long CoT across strong reasoning models reveals a stable distribution of three core behaviors across tasks and architectures: Deep-Reasoning, Self-Reflection, and Self-Exploration [5], which node-centric views fail to capture. This finding triggers a molecular-inspired, distributional view: we model behavior-labeled logic edges as interaction bonds and examine how their global molecular-like structure ensures long-horizon reasoning stability. Specifically, Deep-Reasoning forms dense local clusters of coupled deductions, like covalent bonds; Self-Reflection creates long-range corrective links to prior steps, like hydrogen bonds; and Self-Exploration forges weak bridges between distant clusters, like van der Waals forces. Thus, high-quality Long CoT arises from the stable composition and arrangement of these bond types, guiding effective learning∗.
In this framework, we define semantic isomers as Long CoT trajectories that solve the same tasks and visit similar semantic regions but differ in behavior distributions and transitions. We demonstrate that multiple near-optimal semantic isomers exist per task family, but mixing stable isomers from different strong teachers destabilizes learning, degrading performance and behavior distributions despite matched token statistics. This structurally explains why combining heterogeneous Long CoT traces often fails, beyond token-level distillation. Building on this perspective, we propose Mole-Syn, a structure-aware synthesis framework that first estimates a behavior transition graph from strong reasoning models and then transfers only this behavioral structure to cheaper instruction LLMs via controlled trajectory synthesis, instead of directly copying teacher outputs. This decouples structural transfer from model-specific surface form, enables the generation of Long CoT data that match target behavior distributions from scratch, and yields consistent gains in both Long CoT performance and RL stability across six benchmarks.
Building on this perspective, we propose Mole-Syn, a structure-aware synthesis framework that first estimates a behavior transition graph from strong reasoning models and then transfers only this behavioral structure to cheaper instruction LLMs via controlled trajectory synthesis, instead of directly copying teacher outputs. This decouples structural transfer from model-specific surface form, enables the generation of Long CoT data that match target behavior distributions from scratch, and yields consistent gains in both Long CoT performance and RL stability across six benchmarks.
After that, we analyze the shaping function of each bond in the Long CoT structure. Deep Reasoning bonds encode core logical flow, Self-Reflection bonds support folding pathways to previous steps, and Self-Exploration bonds reinforce long-range consistency checks, enabling targeted bond distributions. Moreover, we discuss why a deteriorated molecular structure is hard to restore, which helps explain how private LLMs protect Long CoT structures from distillation-based imitation. Methods such as summarization and reasoning compression can disrupt Long CoT structure, limiting unauthorized replication of internal reasoning processes.
In summary, our contributions are as follows:
• We model Long CoT as a molecular structure with 3 bonds: deep-reasoning (covalent), self-reflection (hydrogen-bond), and self-exploration (van der Waals), to understand its effective learning.
• We identify effective Semantic Isomers for Long CoT learning, where only entropy-convergent bonds enable stable learning, while competing structures destabilize learning.
• We introduce Mole-Syn, which uses distribution-transfer graphs to synthesize these structures, improving Long CoT and stabilizing RL across 6 benchmarks.
Only distillation from strong reasoning LLMs works. To identify effective data sources, we curated a synthetic set of reasoning traces from three sources. As shown in Figure 3, only distillation from strong reasoning LLMs enables target models to learn and retain Long CoT structure, improving performance on benchmarks requiring extended reasoning. These results indicate that only high-quality traces reliably support both learning and use of Long CoT structures.
Even human-annotated Long-CoT-like traces fail. Inspired by Du et al. [7], we test whether human step-by-step solutions can induce long CoT. We collect human solutions for complex reasoning tasks and fine-tune LLMs on them. Figure 4 shows that human-trace training does not reproduce the long-CoT gains from distilling strong reasoning models, suggesting that human solutions aid local problem solving but may not reliably encode abstractions for long-horizon reasoning distributions.
Deep Reasoning as Covalent Bonds Deep reasoning forms the bone of the thought process, analogous to covalent bonds defining a molecule’s primary chain. It encodes strong logical dependencies (Step A must justify Step B), maintaining direction and continuity; breaking this bone undermines the following steps and destabilizes the answer. By contrast, “Normal Operation” corresponds to stable local bonds within each step, capturing routine computation and direct semantic expression.
Self-Reflection as Hydrogen Bonds Reflection is a key stabilizer. As proteins gain stability when chains fold and form intra-chain hydrogen bonds, reasoning stabilizes when later steps (e.g., Step 100) test, revise, or reinforce earlier premises (e.g., Step 10). These long-range links constrain drift and hallucination, turning a long sequence into a more self-consistent structure. If later checks fail to align with earlier commitments, the reasoning cannot “fold,” indicating a structural logical error.
To clarify the nature of semantic isomers, we examine which bond structures yield effective reasoning configurations. We hypothesize that functional viability depends on specific bond distributions: despite sharing identical conceptual nodes, incompatible configurations disrupt information exchange. For instance, excessive exploration bonds cause fragmented reasoning, whereas overemphasized deep reasoning bonds create rigid chains unable to adapt to new inputs. Details are provided in Appendix D.3.
Effective reasoning bond distribution influences the information divergence speed in reasoning dynamics. To assess this, we compared the reasoning dynamics of R1 models with human cognition in an information phase space [10]. Mechanistically, LLMs update by maximizing rewards and reducing entropy, whereas human reasoning is additionally constrained by semantic coherence and social feedback. Consequently, machine reasoning converges through accumulated gradient updates, whereas human reasoning stabilizes through iterative self-monitoring and social calibration. As shown in Figure 10, we tracked reasoning unfolding over extended chains in logical deduction tasks. Humans typically exhibit nearly uniform forward information gains (81.3% of cases show changes < 0.1), corresponding to a near-zero slope in phase space. In contrast, R1 models display accelerating informativeness (76.1% of cases show absolute changes > 0.1), progressing from low entropy to rapid convergence. These patterns indicate a fundamental difference in how R1 models and humans integrate information over time.
Understanding how distinct reasoning structures interact reveals fundamental limits of complex cognitive systems. As shown in Fig. 11 (a–c), forcibly fusing stable molecular isomers disrupts their backbone; analogously, combining incompatible reasoning frameworks breaks global logical coherence.
Learning two heterogeneous stable structures at the same time will lead to structural chaos in the model. As shown in Fig. 11 (d), we test this by jointly activating two highly correlated (r ≈ 0.9) reasoning chains from DeepSeek-R1 and OpenAI-OSS. Despite their similarity, co-activation prevents the model from converging to a single stable behavioral mode: it produces molecular bond distributions that fluctuate across samples and deviate from those characteristic of either OSS or R1. Consistent with this instability, the self-correlation of the jointly activated model does not exceed 0.8.
This structural chaos leads to a significant decline in the performance of the model. As shown in Fig. 11 (e), joint activation also causes a marked drop in performance relative to either chain alone. This seemingly paradoxical effect indicates that structural compatibility, rather than mere statistical correlation, governs whether reasoning systems can coexist. The interference pattern suggests that the underlying cognitive architecture is rigid: without careful alignment, attempts to merge such systems yield fragmented, low-utility outputs instead of enhanced capability.