LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Note · 2026-02-22 · sourced from Reasoning Architectures

The mainstream approach to test-time scaling requires the model to verbalize intermediate reasoning steps — producing tokens that represent thoughts before producing an answer. Two architectures challenge this assumption from different angles and converge on the same implication: verbalization is a historical artifact of training constraints, not a necessity for reasoning.

Latent depth-recurrent reasoning: A recurrent block is added to a transformer and iterated at inference time for an arbitrary number of steps. The model "thinks" by updating its hidden state repeatedly before producing any output token. Advantages: (1) no specialized training data required — the model trains with a variable compute budget on standard data; (2) less memory than CoT models, which need long context windows; (3) per-token adaptive compute, where difficult tokens get more recurrent iterations; (4) as model parameter count decreases, FLOPs per parameter increase — enabling high compute utilization on smaller models. The architecture naturally supports early stopping via KL-divergence convergence detection.

Heima (Hidden LLaMA): Each intermediate CoT step is compressed into a compact higher-level hidden representation using a single "thinking token." An adaptive decoder reconstructs variable-length textual sequences from the thinking tokens, enabling interpretability without verbosity. The model encodes each CoT step but doesn't need to generate all the intermediate tokens at inference time.

The synthesis point: both architectures suggest that the constraint requiring "expensive internal reasoning must always be projected down to a single verbalized next token appears wasteful" (Latent Depth paper). Continuous latent space can explore multiple reasoning directions simultaneously, without the linear sequential structure that token generation imposes.

This challenges Does more thinking time actually improve LLM reasoning? from an unexpected direction — the myth assumes verbalized tokens are the unit of thinking; latent reasoning questions whether tokens should be the unit at all.

The connection to human cognition is philosophically interesting: "a substantial amount of thought happens through complex, recurrent firing patterns in the brain, before the first word of an answer is uttered." Latent reasoning may capture facets of human reasoning (spatial thinking, physical intuition) that resist verbalization, which current verbalized CoT approaches cannot access by design.

Coconut (Chain of Continuous Thought): A fourth approach feeds the last hidden state back as the next input embedding directly in continuous space, bypassing the language model head and embedding layer entirely. Continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform breadth-first search (BFS) naturally — rather than committing to a single deterministic path like CoT. Coconut outperforms CoT on logical reasoning tasks requiring substantial backtracking. The neuroscience grounding is direct: neuroimaging studies consistently show that the language network remains largely inactive during reasoning tasks, and language appears optimized for communication rather than reasoning. This suggests verbalized CoT forces reasoning through a communication channel it was never designed for. The CoT unfaithfulness literature reinforces this: even when models generate explicit reasoning chains, they may use a different latent reasoning process internally.

Hierarchical Reasoning Model (HRM): A third distinct latent reasoning architecture adds brain-inspired multi-timescale processing. HRM couples a slow high-level module (abstract planning) with a fast low-level module (detailed computation) in hierarchical recurrence. The fast module reaches equilibrium, then the slow module advances — "hierarchical convergence" avoids premature convergence of standard recurrence. With only 27M parameters and 1000 samples (no pretraining, no CoT), HRM achieves near-perfect accuracy on Sudoku-Extreme and 30×30 maze pathfinding — tasks where CoT methods completely fail (0% accuracy). Uses O(1) memory gradient approximation at equilibrium, avoiding BPTT entirely. See Can recurrent hierarchies achieve reasoning that transformers cannot?.

Theoretical consolidation: These converging architectures now have a formal theoretical framework. Since Where does LLM reasoning actually happen during generation?, the depth-recurrent, Heima, Coconut, HRM, and energy-based approaches all constitute evidence for H1 (latent-state trajectories as the primary reasoning medium). The framework also clarifies why these approaches work: if reasoning is fundamentally a latent-state process, then architectures that operate directly in latent space are working with the native medium rather than forcing it through the bottleneck of discrete verbalization. Furthermore, since Can we trigger reasoning without explicit chain-of-thought prompts?, the latent reasoning capability exists even in standard transformer architectures — specialized latent architectures may be optimizing the medium rather than creating a new capability.

Practical constraint on retrofitting: A critical caveat for deployment: Can continuous reasoning avoid forgetting in instruction-tuned models? shows that fine-tuning already-capable instruction-tuned models for continuous reasoning via Coconut/CCoT methods causes catastrophic forgetting. This limits the Coconut approach to training-from-scratch scenarios and motivates frozen-backbone alternatives for enhancing existing models.


Source: Reasoning Architectures, Novel Architectures, Cognitive Models Latent

Related concepts in this collection

Concept map
26 direct connections · 201 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

latent reasoning in continuous space scales test-time compute without verbalized tokens or specialized training data