How do Transformers Learn Implicit Reasoning?

Paper · arXiv 2505.23653 · Published May 29, 2025

Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly—producing correct answers without explicitly verbalizing intermediate steps—but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.

Large language models (LLMs) demonstrate strong performance on complex, multi-step reasoning tasks [9, 7, 29, 1, 32, 19, 14]. Typically, these reasoning abilities are elicited using chain-of-thought (CoT) prompting, which encourages models to explicitly articulate intermediate reasoning steps [28, 35, 6, 33, 30]. Beyond CoT, recent studies indicate that LLMs can also engage in implicit reasoning [37, 5, 11, 15], producing correct answers without verbalizing intermediate steps.

While implicit reasoning is widely acknowledged, the internal mechanisms that empower this ability remain unclear. In this paper, we aim to uncover the internal processes of implicit reasoning by examining a concrete, structured scenario: multi-hop implicit reasoning, where the model must answer compositional queries (e.g., (e1, r1, r2) → e3) by implicitly traversing an intermediate entity e2, without explicitly verbalizing it. A fundamental question in this scenario is that: does the model genuinely conduct step-by-step reasoning internally, or is it merely recalling the answer from its memorized knowledge? Although both behaviors can produce correct outcomes, they reflect fundamentally distinct cognitive processes. This observation motivates our central research question:

How do LLMs acquire and perform implicit reasoning during training and inference?

Existing studies that investigate implicit reasoning often rely on pretrained LLMs whose training data lacks precise experimental control, making it challenging to conclusively determine whether models have genuinely learned implicit multi-step reasoning or instead rely on its prior knowledge or shortcut solutions [12, 34, 5]. Symbolic datasets [25, 27, 26] partially alleviate this concern by training models from scratch, yet they still lack the fine-grained experimental control and behavioral granularity necessary for deeper analysis. To address these limitations, we construct an extended symbolic environment, featuring targeted omissions and query-level variations, to precisely identify whether implicit reasoning and generalization truly emerge.

To facilitate the analysis under our symbolic environment, we introduce two diagnostic tools that overcome specific limitations of prior methods: (1) cross-query semantic patching, which enhances causal interpretability by locating intermediate entity representations based on their semantic transferability across queries rather than solely their impact on final outputs; and (2) a cosine-based representational lens, which avoids assumptions inherent in decoding-based probing by examining structural consistency of internal representations across reasoning contexts. Together, these tools enable precise examination of the internal processes driving implicit reasoning.

Our empirical analysis begins with a behavioral study conducted under fine-grained experimental control (Section 2). Under a complete training configuration, we observe that multi-hop implicit reasoning emerges in three distinct stages: memorization, in-distribution generalization, and finally cross-distribution generalization. Through ablation studies, we further demonstrate that while exposure to in-distribution (ID) triples is not strictly necessary for achieving in-distribution generalization, its absence significantly delays the onset of this behavior. Additionally, we find that generalization to second-hop queries fails unless the model encounters exact compositional structures during training, revealing a strong dependency on query-level exposure.

These behavioral insights reveal previously unreported patterns, motivating us to revisit and probe the internal mechanisms of implicit reasoning. In Section 3, we first use cross-query semantic patching to localize intermediate entity representations, typically identifying them within the middle layers corresponding to the r1 tokens. We then test the common assumption that intermediate entities are explicitly decodable from internal states and find this assumption inconsistent with our observed reasoning behavior. This disconnect leads us to adopt a geometric perspective, wherein successful reasoning strongly correlates with consistent clustering of intermediate representations within cosine similarity space (Figure 1).

In Section 4, we close the loop by explicitly connecting these internal representational mechanisms to external behavioral patterns. We demonstrate that successful generalization robustly correlates with the clustering structure of intermediate representations across diverse queries and training distributions. Although in-distribution (ID) triple supervision is not required to induce this clustering, it substantially accelerates its emergence by constraining the representational space early in training. Finally, we identify that what appears to be first-hop generalization to out-of-distribution (OOD) triples is actually an artifact arising from representational alignments induced by ID exposure, highlighting the fragile and data-dependent nature of implicit generalization.

2.1 Data Construction: Fine-Grained Control for Compositional Reasoning

To enable fine-grained behavioral analysis, we extend the symbolic reasoning setup of Wang et al. [25] with expressive query-level control configurations. The data comprises atomic triples and compositional queries:

• Atomic Triples. Each atomic fact is represented as a triple (e1, r1) → e2. This formulation mimics simple factual relations such as (Alice, mother-of) → Beth and (Beth, sister-of) → Carol, serving as the atomic unit of the reasoning environment. The triples are partitioned into two subsets: In-Distribution (ID) Triples are used in both standalone form and as components of multi-hop training queries; Out-of-Distribution (OOD) Triples appear in training data only in standalone form, and are excluded from multi-hop composition, enabling the creation of test queries involving out-of-distribution reasoning. Note that ID and OOD triples share the same set of entities and relations.

• 2-Hop Queries. Each reasoning task takes the form of a compositional chain (e1, r1, r2) → e3, where the model performs implicit reasoning over an bridge entity e2. For instance, the model receives only the compositional query (Alice, mother-of, sister-of) and is expected to predict the correct target Carol, implicitly reasoning through the intermediate entity Beth. We distinguish: Test-OI: test queries where the first hop comes from an OOD triple and the second hop from an ID triple; Train-II: queries with both hops from ID triples used during training. Other query types, such as Test-II, Test-IO, and Test-OO, follow similar definitions.

Phase I: Memorization. The initial stage involves quickly fitting the training data, including atomic facts and 2-hop compositions. The model memorizes these facts, but generalization to unseen queries remains minimal.

Phase II: ID Generalization. After memorization saturates, the model begins to generalize to Test-II queries (unseen ID-ID compositions), marking a shift from memorization to compositional generalization within ID, akin to the grokking phenomenon described by Wang et al. [25].

Phase III: Cross-Distribution Reasoning. The model next learns to generalize across distributions, gradually incorporating OOD triples in the first hop while maintaining the ID in the second. This transition is slower than Phase II and requires more training. Building on the grokking phenomenon, our analysis uncovers this additional phase of generalization across distributional boundaries. Interestingly, generalization fails consistently when the second hop is from OOD triples, revealing a stronger bottleneck in the second relational step. These phases show that reasoning develops in structured stages, each with distinct patterns of success and failure, highlighting the need to treat reasoning not as a monolithic ability, but as a set of behaviors with separable developmental conditions.

2.3 ID Triples Are Not Required for ID Generalization—but Accelerate It

Prior work has repeatedly observed that while models can correctly answer individual atomic triples, they often fail to generalize to 2-hop queries constructed by composing those same triples[3, 31, 42].

Our study, conducted in a controlled symbolic dataset environment, reveals key insights into the mechanisms of implicit reasoning in transformers, highlighting specific patterns and behaviors that clarify how multi-hop implicit reasoning emerges. These findings may provide valuable answers to existing questions about the implicit reasoning capabilities of LLMs. For instance, our observation regarding the requirement for query-level match offers a potential explanation for why knowledge learned from single-hop tasks does not easily transfer to multi-hop reasoning in LLMs [3, 39, 31]. However, it is important to note that LLMs operate with far richer and more complex knowledge bases, and their internal knowledge interaction mechanisms likely differ from those in our controlled environment. Therefore, while our findings offer useful insights, they should be regarded as preliminary guidance rather than a complete explanation of the reasoning dynamics in LLMs.