LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can recurrent hierarchies achieve reasoning that transformers cannot?

Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.

Note · 2026-02-23 · sourced from Novel Architectures

The Hierarchical Reasoning Model (HRM) is a recurrent architecture with two coupled modules: a high-level (H) module for slow, abstract planning and a low-level (L) module for fast, detailed computation. The key mechanism is "hierarchical convergence" — the fast L-module completes multiple computational steps and reaches local equilibrium, then the slow H-module advances, and L is reset for a new phase. This avoids the rapid premature convergence that plagues standard recurrent models.

The results are striking. With only 27 million parameters and 1,000 training samples, no pre-training or CoT data, HRM achieves near-perfect accuracy on Sudoku-Extreme Full and optimal pathfinding in 30×30 mazes — tasks where state-of-the-art CoT methods achieve 0% accuracy. It outperforms much larger models with significantly longer context windows on ARC, a key AGI benchmark.

The architecture is brain-inspired: the human brain organizes computation hierarchically across cortical regions operating at different timescales. Recurrent feedback loops iteratively refine representations — slow higher-level areas guide, fast lower-level circuits execute. The brain achieves this depth without backpropagation through time.

HRM mirrors this with an O(1) memory gradient approximation. Because each recurrent module converges to a fixed point, gradients can be computed at equilibrium in a single step rather than unrolling through time. The gradient path is: output head → final H-state → final L-state → input embedding. No BPTT, no O(T) memory. This aligns with neuroscience evidence that cortical credit assignment uses short-range, temporally local mechanisms.

The deeper implication: standard Transformers are "paradoxically shallow" despite deep learning's founding principle of stacking layers. Their fixed depth places them in AC0/TC0 complexity classes — they are not Turing-complete and cannot execute complex algorithmic reasoning in a purely end-to-end manner. HRM's hierarchical recurrence escapes this constraint by achieving effectively unbounded computational depth.

This extends Can models reason without generating visible thinking tokens? with a third distinct architecture beyond depth-recurrent and Heima — one that introduces hierarchical multi-timescale processing rather than uniform recurrence.

Source: Novel Architectures

Related concepts in this collection

Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
HRM adds hierarchical dual-module architecture as a third latent reasoning approach
Can models reason without generating visible thinking steps? Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
HRM provides the strongest empirical evidence: near-perfect on tasks where verbalized CoT fails completely
Can parallel architectures solve fundamentally sequential problems? Explores whether pure parallel computation—like Transformers—can tackle problems requiring long chains of dependent reasoning, or if serial depth is theoretically necessary for certain classes of problems.
HRM is an architecture that implements serial scaling through hierarchical recurrence
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
HRM challenges from another direction: right architecture > more thinking tokens
Can energy minimization unlock reasoning without domain-specific training? Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?
alternative latent reasoning architecture: HRM uses hierarchical recurrence for serial depth, EBTs use energy minimization for iterative refinement; both escape TC0 limitation without verbalized tokens but via fundamentally different mechanisms

Concept map

14 direct connections · 130 in 2-hop network ·dense cluster

Can recurrent hierarchies achieve reasoning that… Can models reason without generating visible think… Can models reason without generating visible think… Can parallel architectures solve fundamentally seq… Does more thinking time actually improve LLM reaso… Can energy minimization unlock reasoning without d…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

hierarchical dual-recurrence achieves effective computational depth that standard transformers cannot — enabling latent reasoning without chain of thought