LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can recurrent hierarchies achieve reasoning that transformers cannot?

Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.

Note · 2026-02-23 · sourced from Novel Architectures

The Hierarchical Reasoning Model (HRM) is a recurrent architecture with two coupled modules: a high-level (H) module for slow, abstract planning and a low-level (L) module for fast, detailed computation. The key mechanism is "hierarchical convergence" — the fast L-module completes multiple computational steps and reaches local equilibrium, then the slow H-module advances, and L is reset for a new phase. This avoids the rapid premature convergence that plagues standard recurrent models.

The results are striking. With only 27 million parameters and 1,000 training samples, no pre-training or CoT data, HRM achieves near-perfect accuracy on Sudoku-Extreme Full and optimal pathfinding in 30×30 mazes — tasks where state-of-the-art CoT methods achieve 0% accuracy. It outperforms much larger models with significantly longer context windows on ARC, a key AGI benchmark.

The architecture is brain-inspired: the human brain organizes computation hierarchically across cortical regions operating at different timescales. Recurrent feedback loops iteratively refine representations — slow higher-level areas guide, fast lower-level circuits execute. The brain achieves this depth without backpropagation through time.

HRM mirrors this with an O(1) memory gradient approximation. Because each recurrent module converges to a fixed point, gradients can be computed at equilibrium in a single step rather than unrolling through time. The gradient path is: output head → final H-state → final L-state → input embedding. No BPTT, no O(T) memory. This aligns with neuroscience evidence that cortical credit assignment uses short-range, temporally local mechanisms.

The deeper implication: standard Transformers are "paradoxically shallow" despite deep learning's founding principle of stacking layers. Their fixed depth places them in AC0/TC0 complexity classes — they are not Turing-complete and cannot execute complex algorithmic reasoning in a purely end-to-end manner. HRM's hierarchical recurrence escapes this constraint by achieving effectively unbounded computational depth.

This extends Can models reason without generating visible thinking tokens? with a third distinct architecture beyond depth-recurrent and Heima — one that introduces hierarchical multi-timescale processing rather than uniform recurrence.


Source: Novel Architectures

Related concepts in this collection

Concept map
14 direct connections · 130 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

hierarchical dual-recurrence achieves effective computational depth that standard transformers cannot — enabling latent reasoning without chain of thought