LLM Reasoning and Architecture Reinforcement Learning for LLMs

How does multi-hop reasoning develop during transformer training?

Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.

Note · 2026-02-22 · sourced from Reasoning Logic Internal Rules

Training transformers from scratch in a controlled symbolic environment reveals that implicit multi-hop reasoning — answering compositional queries without verbalizing intermediate steps — emerges through three distinct developmental stages:

Phase I: Memorization. The model fits training data (atomic facts and 2-hop compositions) quickly. Generalization to unseen queries remains minimal.

Phase II: In-Distribution Generalization. After memorization saturates, the model begins generalizing to unseen ID-ID compositions — a shift from memorization to compositional reasoning within the training distribution. This resembles grokking: generalization emerges well after memorization converges.

Phase III: Cross-Distribution Reasoning. The model learns to compose OOD triples in the first hop with ID triples in the second. This transition is slower than Phase II. Crucially, generalization fails consistently when the SECOND hop is from OOD triples, revealing a stronger bottleneck in the second relational step.

Two mechanistic findings deepen the picture:

Cosine clustering as signature. Successful reasoning correlates with consistent clustering of intermediate entity representations within cosine similarity space. Models that reason well show intermediate representations that cluster by entity identity across diverse queries. This clustering provides a geometric explanation for when reasoning works and when it fails.

Query-level exposure is required. Second-hop generalization fails unless the model encounters the exact compositional structure during training. Single-hop knowledge does not automatically compose into multi-hop capability — a finding that helps explain why Do language models actually use their encoded knowledge?: encoding facts individually doesn't guarantee they compose.

Grokking provides parallel three-phase evidence. The "Progress Measures for Grokking via Mechanistic Interpretability" paper reverse-engineers the grokking phenomenon in transformers trained on modular addition, revealing three continuous phases that closely parallel the three developmental stages above: (1) memorization — the model fits training data quickly, (2) circuit formation — structured mechanisms gradually amplify in the weights (the generalizing circuit emerges), and (3) cleanup — memorizing components are removed. The parallel between memorization → ID generalization → cross-distribution reasoning and memorization → circuit formation → cleanup suggests a shared underlying dynamic: generalization requires extended training well beyond the point of memorization, and proceeds through the gradual formation of structured internal mechanisms. The grokking paper confirms this with a mechanistic explanation: the generalizing circuit uses discrete Fourier transforms and trigonometric identities. See What happens inside models when they suddenly generalize?.

The three-stage trajectory has implications for understanding RL-trained reasoning models. Since Do base models already contain hidden reasoning ability?, the question becomes: which stage does RL training target? If RL primarily accelerates Phase II (ID generalization), it explains why Does the choice of RL algorithm actually matter for reasoning? — different algorithms may trigger the same phase transition.

Source: Reasoning Logic Internal Rules

Related concepts in this collection

Do language models actually use their encoded knowledge? Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
encoding ≠ composition; this paper shows the mechanism for when composition emerges
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
three-stage emergence framework for understanding what "unlocking" means
Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
RL may target specific phase transitions in the emergence trajectory
Do reasoning cycles in hidden states reveal aha moments? What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
cosine clustering is a representational-level analogue to the topological "aha moment"
Can neural networks learn compositional skills without symbolic mechanisms? Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
shared condition: both findings show compositional reasoning requires training exposure to the compositional structure, not just individual components; query-level exposure (this note) and task-space coverage (that note) are the same constraint at different scales

Concept map

16 direct connections · 179 in 2-hop network ·dense cluster

How does multi-hop reasoning develop during tran… Do language models actually use their encoded know… Do base models already contain hidden reasoning ab… Does the choice of RL algorithm actually matter fo… Do reasoning cycles in hidden states reveal aha mo… Can neural networks learn compositional skills wit…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

implicit multi-hop reasoning in transformers emerges through three developmental stages with cosine clustering as the mechanistic signature