Why does reasoning graph topology evolve differently across training phases?

This explores why the shape of a model's reasoning — how thoughts branch, connect, and consolidate — changes from phase to phase as it learns, rather than settling into one fixed structure.

This explores why the *shape* of reasoning — how ideas branch, link back, and consolidate into a graph — shifts as training progresses, instead of staying constant. The corpus doesn't treat reasoning structure as one frozen thing; it treats topology as something that organizes itself over time, and the most direct answer comes from work showing reasoning graphs drift toward a particular stable regime. When an agent reasons by iteratively building a graph, the structure evolves toward a 'critical state' where semantic surprise keeps outrunning structural connection — roughly 12% of edges stay genuinely surprising even after they're wired in, which is exactly what keeps the system discovering rather than collapsing into repetition Why do reasoning systems keep discovering new connections?. Early phases look different from late phases because the graph is moving toward that balance point, not sitting on it.

A second reason the phases differ is that different training stages are doing fundamentally different jobs. One striking line of evidence argues that base pretraining already installs the *capability* to reason in latent form, and that RL post-training mostly teaches *when* to deploy it rather than *how* to do it — hybrid models recover 91% of the gains just by routing tokens, and the activation vectors for reasoning strategies exist before any RL touches them Does RL post-training create reasoning or just deploy it?. If early and late training operate on different variables (forming the latent structure vs. tuning its triggering), you'd expect the observable topology to evolve qualitatively, not just get 'more of the same.'

It helps to remember that topology here isn't a metaphor — it's the actual computational structure. Chain-of-thought is a path graph, tree-of-thought is a tree, and graph-of-thought allows in-degree greater than one, which is what lets it synthesize multiple sub-results in a way trees can't express Can reasoning topologies be formally classified as graph types?. So when topology 'evolves,' the model is gaining or losing the ability to do things like divide-and-conquer merging. That reframes the question: phases differ because the model is acquiring structural moves, not just better answers.

The corpus also offers a cautionary counter-thread worth sitting with: structure may matter more than the *content* flowing through it. Models trained on deliberately corrupted or irrelevant reasoning traces perform comparably to those trained on correct ones, suggesting traces act as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?, and a parallel line argues CoT is constrained imitation of reasoning's *form* rather than genuine inference What makes chain-of-thought reasoning actually work?. If that's right, what evolves across phases is the scaffold's shape and the model's discipline in using it — which is why interventions like penalizing premature thought-switching can sharpen accuracy purely at decoding time, with no retraining at all Do reasoning models switch between ideas too frequently?, Why do reasoning models abandon promising solution paths?.

The thing you might not have known you wanted: 'topology evolving across phases' isn't one phenomenon but at least three layered ones — a self-organizing drift toward a discovery-sustaining critical point, a handoff between training stages that build capability vs. tune its deployment, and a gradual unlocking of structural operations (branching, merging) that simpler topologies couldn't perform. The open and slightly unsettling question underneath all of it is whether the evolving structure is carrying real reasoning or just an increasingly well-shaped imitation of it.

Sources 7 notes

Why do reasoning systems keep discovering new connections?

Analysis shows iterative graph reasoning evolves toward a stable phase where semantic entropy persistently dominates structural entropy, with ~12% of edges remaining semantically surprising despite structural connection, fueling ongoing discovery.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can reasoning topologies be formally classified as graph types?

CoT, ToT, and GoT map precisely to path graphs, trees, and arbitrary directed graphs respectively. The topology is not metaphorical but defines actual computational structure—GoT's in-degree > 1 enables divide-and-conquer synthesis that trees cannot express.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining why reasoning graph topology evolves across training phases in LLMs. The question remains open; treat the findings below as dated claims to be stress-tested against newer models and methods.

What a curated library found — and when (spanning Jan 2024–Aug 2025):
• Reasoning graphs drift toward a 'critical state' where ~12% of edges remain semantically surprising, sustaining discovery over repetition (2025-03).
• Base pretraining installs reasoning *capability* latently; RL post-training teaches *when* to deploy it, not *how* — hybrid routing recovers 91% of gains without retraining (implied by 2025-05 findings).
• Chain-of-thought, tree-of-thought, and graph-of-thought are formal graph types; topology evolution means acquiring structural moves like divide-and-conquer merging (2024-01).
• Reasoning traces act as computational scaffolding, not meaningful inference — models trained on deliberately corrupted traces perform comparably to correct ones (2025-06, 2025-08).
• Penalizing premature thought-switching sharpens accuracy at decoding time with no retraining (2025-01, 2025-05).

Anchor papers (verify; mind their dates):
• arXiv:2401.14295 (2024-01): Demystifying Chains, Trees, and Graphs of Thoughts
• arXiv:2503.18852 (2025-03): Self-Organizing Graph Reasoning Evolves into a Critical State
• arXiv:2506.02878 (2025-06): CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate
• arXiv:2508.01191 (2025-08): Is Chain-of-Thought Reasoning of LLMs a Mirage?

Your task:
(1) RE-TEST EACH CONSTRAINT. For the critical-state claim: has scaling or new training methods since Q3 2025 overturned or refined the 12% semanticity threshold? For the capability/deployment split: do recent o1-class or reasoning-specialized models (post-2025-06) show capability *acquisition* during RL, contradicting the 91% routing thesis? For scaffolding vs. inference: have newer mechanistic studies isolated reasoning circuits that survive when traces are removed?
(2) Surface the strongest DISAGREEMENT. The corpus bifurcates on whether topology evolution reflects genuine reasoning discovery or increasingly disciplined imitation. Which papers since July 2025 (if any) directly pit these interpretations against each other? Where does the fault line hold?
(3) Propose 2 research questions that assume the regime may have shifted:
   – If reasoning capability *is* largely preinstalled, what does late-stage RL optimization actually change in the graph — and can we measure it without looking at outputs?
   – If traces are scaffolding, not inference, why does topology *shape* (e.g., critical-state balance) predict downstream generalization better than content?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does reasoning graph topology evolve differently across training phases?

Sources 7 notes

Next inquiring lines