Why does reasoning graph topology evolve differently across training phases?
This explores why the shape of a model's reasoning — how thoughts branch, connect, and consolidate — changes from phase to phase as it learns, rather than settling into one fixed structure.
This explores why the *shape* of reasoning — how ideas branch, link back, and consolidate into a graph — shifts as training progresses, instead of staying constant. The corpus doesn't treat reasoning structure as one frozen thing; it treats topology as something that organizes itself over time, and the most direct answer comes from work showing reasoning graphs drift toward a particular stable regime. When an agent reasons by iteratively building a graph, the structure evolves toward a 'critical state' where semantic surprise keeps outrunning structural connection — roughly 12% of edges stay genuinely surprising even after they're wired in, which is exactly what keeps the system discovering rather than collapsing into repetition Why do reasoning systems keep discovering new connections?. Early phases look different from late phases because the graph is moving toward that balance point, not sitting on it.
A second reason the phases differ is that different training stages are doing fundamentally different jobs. One striking line of evidence argues that base pretraining already installs the *capability* to reason in latent form, and that RL post-training mostly teaches *when* to deploy it rather than *how* to do it — hybrid models recover 91% of the gains just by routing tokens, and the activation vectors for reasoning strategies exist before any RL touches them Does RL post-training create reasoning or just deploy it?. If early and late training operate on different variables (forming the latent structure vs. tuning its triggering), you'd expect the observable topology to evolve qualitatively, not just get 'more of the same.'
It helps to remember that topology here isn't a metaphor — it's the actual computational structure. Chain-of-thought is a path graph, tree-of-thought is a tree, and graph-of-thought allows in-degree greater than one, which is what lets it synthesize multiple sub-results in a way trees can't express Can reasoning topologies be formally classified as graph types?. So when topology 'evolves,' the model is gaining or losing the ability to do things like divide-and-conquer merging. That reframes the question: phases differ because the model is acquiring structural moves, not just better answers.
The corpus also offers a cautionary counter-thread worth sitting with: structure may matter more than the *content* flowing through it. Models trained on deliberately corrupted or irrelevant reasoning traces perform comparably to those trained on correct ones, suggesting traces act as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?, and a parallel line argues CoT is constrained imitation of reasoning's *form* rather than genuine inference What makes chain-of-thought reasoning actually work?. If that's right, what evolves across phases is the scaffold's shape and the model's discipline in using it — which is why interventions like penalizing premature thought-switching can sharpen accuracy purely at decoding time, with no retraining at all Do reasoning models switch between ideas too frequently?, Why do reasoning models abandon promising solution paths?.
The thing you might not have known you wanted: 'topology evolving across phases' isn't one phenomenon but at least three layered ones — a self-organizing drift toward a discovery-sustaining critical point, a handoff between training stages that build capability vs. tune its deployment, and a gradual unlocking of structural operations (branching, merging) that simpler topologies couldn't perform. The open and slightly unsettling question underneath all of it is whether the evolving structure is carrying real reasoning or just an increasingly well-shaped imitation of it.
Sources 7 notes
Analysis shows iterative graph reasoning evolves toward a stable phase where semantic entropy persistently dominates structural entropy, with ~12% of edges remaining semantically surprising despite structural connection, fueling ongoing discovery.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
CoT, ToT, and GoT map precisely to path graphs, trees, and arbitrary directed graphs respectively. The topology is not metaphorical but defines actual computational structure—GoT's in-degree > 1 enables divide-and-conquer synthesis that trees cannot express.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.