LLM Reasoning and Architecture Reinforcement Learning for LLMs

How does multi-hop reasoning develop during transformer training?

Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.

Note · 2026-02-22 · sourced from Reasoning Logic Internal Rules
What makes chain-of-thought reasoning actually work? How do LLMs fail to know what they seem to understand? How should researchers navigate LLM reasoning research?

Training transformers from scratch in a controlled symbolic environment reveals that implicit multi-hop reasoning — answering compositional queries without verbalizing intermediate steps — emerges through three distinct developmental stages:

Phase I: Memorization. The model fits training data (atomic facts and 2-hop compositions) quickly. Generalization to unseen queries remains minimal.

Phase II: In-Distribution Generalization. After memorization saturates, the model begins generalizing to unseen ID-ID compositions — a shift from memorization to compositional reasoning within the training distribution. This resembles grokking: generalization emerges well after memorization converges.

Phase III: Cross-Distribution Reasoning. The model learns to compose OOD triples in the first hop with ID triples in the second. This transition is slower than Phase II. Crucially, generalization fails consistently when the SECOND hop is from OOD triples, revealing a stronger bottleneck in the second relational step.

Two mechanistic findings deepen the picture:

Cosine clustering as signature. Successful reasoning correlates with consistent clustering of intermediate entity representations within cosine similarity space. Models that reason well show intermediate representations that cluster by entity identity across diverse queries. This clustering provides a geometric explanation for when reasoning works and when it fails.

Query-level exposure is required. Second-hop generalization fails unless the model encounters the exact compositional structure during training. Single-hop knowledge does not automatically compose into multi-hop capability — a finding that helps explain why Do language models actually use their encoded knowledge?: encoding facts individually doesn't guarantee they compose.

Grokking provides parallel three-phase evidence. The "Progress Measures for Grokking via Mechanistic Interpretability" paper reverse-engineers the grokking phenomenon in transformers trained on modular addition, revealing three continuous phases that closely parallel the three developmental stages above: (1) memorization — the model fits training data quickly, (2) circuit formation — structured mechanisms gradually amplify in the weights (the generalizing circuit emerges), and (3) cleanup — memorizing components are removed. The parallel between memorization → ID generalization → cross-distribution reasoning and memorization → circuit formation → cleanup suggests a shared underlying dynamic: generalization requires extended training well beyond the point of memorization, and proceeds through the gradual formation of structured internal mechanisms. The grokking paper confirms this with a mechanistic explanation: the generalizing circuit uses discrete Fourier transforms and trigonometric identities. See What happens inside models when they suddenly generalize?.

The three-stage trajectory has implications for understanding RL-trained reasoning models. Since Do base models already contain hidden reasoning ability?, the question becomes: which stage does RL training target? If RL primarily accelerates Phase II (ID generalization), it explains why Does the choice of RL algorithm actually matter for reasoning? — different algorithms may trigger the same phase transition.


Source: Reasoning Logic Internal Rules

Related concepts in this collection

Concept map
16 direct connections · 179 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

implicit multi-hop reasoning in transformers emerges through three developmental stages with cosine clustering as the mechanistic signature