Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

Paper · arXiv 2604.07822 · Published April 9, 2026
Novel ArchitecturesReasoning o1 o3 SearchInference time scaling

Large language models (LLMs) (Brown et al., 2020) are known to acquire substantial factual knowledge during pretraining, storing it in their parameters (Geva et al., 2023). However, how effectively this knowledge can be composed for reasoning remains less understood (Dziri et al., 2023; Press et al., 2023). In particular, recent work shows that transformer-based LLMs struggle under implicit reasoning, i.e. reasoning within a single forward pass without explicit chain-of-thought (CoT) (Wei et al., 2022). Such failures reveal a fundamental limitation of transformers: despite storing rich knowledge, they are often unable to flexibly combine it to solve novel questions. This limitation has important implications for generalization, as many tasks require composing multiple pieces of seen knowledge in novel ways not observed during training (Lake & Baroni, 2018; Berglund et al., 2023).

Why do transformers struggle to combine their parametric knowledge in implicit reasoning? Consider a query such as “The spouse of the performer of Imagine is”. Previous work shows that transformers solve this by chaining two facts: first retrieving that the performer of Imagine is John Lennon in shallow layers, and then that the spouse of John Lennon is Yoko Ono in deeper layers (Biran et al., 2024; Wang et al., 2024a; Yang et al., 2024b). However, since knowledge is distributed across different layers of the transformer, there is no guarantee that the fact required for a particular query can be accessed correctly. For example, if the fact the spouse of John Lennon is Yoko Ono is only stored in shallow layers, deeper layers cannot access it because parameters are not shared across layers. While transformers can be trained to learn to combine such knowledge properly (Wang et al., 2024a; Yao et al., 2025), they fail to compositionally generalize to unfamiliar combinations or deeper recursive combinations.

To address this limitation, we introduce depth-recurrence into transformers, allowing the same set of layers to be applied iteratively. The input sequence is processed multiple times by a shared transformer block, where the output of each iteration serves as input to the next. In contrast to vanilla transformers, where knowledge is tied to specific layers, recurrence enables more flexible access to and composition of parametric knowledge within a single forward process. Such models, known as recurrent-depth transformers or looped transformers, have recently gained attention as a promising architecture (Dehghani et al., 2019; Geiping et al., 2025; Zhu et al., 2025). While prior work has shown that recurrent-depth transformers improve length generalization (Bansal et al., 2022; Fan et al., 2025), it remains unclear whether they can overcome compositional generalization limitations when reasoning over parametric knowledge.

In this paper, we systematically study whether recurrent-depth transformers can compositionally combine their parametric knowledge implicitly. By constructing synthetic datasets, we train models to learn implicit reasoning from scratch. Unlike LLMs trained on vast, opaque web-scale corpora, this setup provides control over the data and mitigates confounding biases introduced during pretraining. Specifically, we characterize two challenges: systematic generalization (combining knowledge not used in any composition during training) and depth extrapolation (e.g., training on 5-hop reasoning and evaluating on 10-hop).

Our main findings are two-fold. First, recurrent-depth transformers exhibit strong systematic generalization, while vanilla transformers fail to do so. We show that this ability emerges through a sharp three-stage grokking process, that transitions from memorization to in-distribution generalization, and finally to systematic generalization. We also support this with evidence from the internal activations of models across different training stages. Second, recurrent-depth transformers enable depth extrapolation, generalizing to reasoning depths beyond those observed during training, as inference-time compute (i.e., recurrent iterations) increases. We further find that the training-time recurrence strategy plays a critical role in extrapolation performance, with dynamic recurrence achieving the strongest generalization. Despite these gains, we identify a key limitation: recurrent-depth transformers suffer from overthinking (Bansal et al., 2022), which degrades performance and limits generalization to extremely deep recursions.