Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
The mainstream approach to test-time scaling requires the model to verbalize intermediate reasoning steps — producing tokens that represent thoughts before producing an answer. Two architectures challenge this assumption from different angles and converge on the same implication: verbalization is a historical artifact of training constraints, not a necessity for reasoning.
Latent depth-recurrent reasoning: A recurrent block is added to a transformer and iterated at inference time for an arbitrary number of steps. The model "thinks" by updating its hidden state repeatedly before producing any output token. Advantages: (1) no specialized training data required — the model trains with a variable compute budget on standard data; (2) less memory than CoT models, which need long context windows; (3) per-token adaptive compute, where difficult tokens get more recurrent iterations; (4) as model parameter count decreases, FLOPs per parameter increase — enabling high compute utilization on smaller models. The architecture naturally supports early stopping via KL-divergence convergence detection.
Heima (Hidden LLaMA): Each intermediate CoT step is compressed into a compact higher-level hidden representation using a single "thinking token." An adaptive decoder reconstructs variable-length textual sequences from the thinking tokens, enabling interpretability without verbosity. The model encodes each CoT step but doesn't need to generate all the intermediate tokens at inference time.
The synthesis point: both architectures suggest that the constraint requiring "expensive internal reasoning must always be projected down to a single verbalized next token appears wasteful" (Latent Depth paper). Continuous latent space can explore multiple reasoning directions simultaneously, without the linear sequential structure that token generation imposes.
This challenges Does more thinking time actually improve LLM reasoning? from an unexpected direction — the myth assumes verbalized tokens are the unit of thinking; latent reasoning questions whether tokens should be the unit at all.
The connection to human cognition is philosophically interesting: "a substantial amount of thought happens through complex, recurrent firing patterns in the brain, before the first word of an answer is uttered." Latent reasoning may capture facets of human reasoning (spatial thinking, physical intuition) that resist verbalization, which current verbalized CoT approaches cannot access by design.
Coconut (Chain of Continuous Thought): A fourth approach feeds the last hidden state back as the next input embedding directly in continuous space, bypassing the language model head and embedding layer entirely. Continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform breadth-first search (BFS) naturally — rather than committing to a single deterministic path like CoT. Coconut outperforms CoT on logical reasoning tasks requiring substantial backtracking. The neuroscience grounding is direct: neuroimaging studies consistently show that the language network remains largely inactive during reasoning tasks, and language appears optimized for communication rather than reasoning. This suggests verbalized CoT forces reasoning through a communication channel it was never designed for. The CoT unfaithfulness literature reinforces this: even when models generate explicit reasoning chains, they may use a different latent reasoning process internally.
Hierarchical Reasoning Model (HRM): A third distinct latent reasoning architecture adds brain-inspired multi-timescale processing. HRM couples a slow high-level module (abstract planning) with a fast low-level module (detailed computation) in hierarchical recurrence. The fast module reaches equilibrium, then the slow module advances — "hierarchical convergence" avoids premature convergence of standard recurrence. With only 27M parameters and 1000 samples (no pretraining, no CoT), HRM achieves near-perfect accuracy on Sudoku-Extreme and 30×30 maze pathfinding — tasks where CoT methods completely fail (0% accuracy). Uses O(1) memory gradient approximation at equilibrium, avoiding BPTT entirely. See Can recurrent hierarchies achieve reasoning that transformers cannot?.
Theoretical consolidation: These converging architectures now have a formal theoretical framework. Since Where does LLM reasoning actually happen during generation?, the depth-recurrent, Heima, Coconut, HRM, and energy-based approaches all constitute evidence for H1 (latent-state trajectories as the primary reasoning medium). The framework also clarifies why these approaches work: if reasoning is fundamentally a latent-state process, then architectures that operate directly in latent space are working with the native medium rather than forcing it through the bottleneck of discrete verbalization. Furthermore, since Can we trigger reasoning without explicit chain-of-thought prompts?, the latent reasoning capability exists even in standard transformer architectures — specialized latent architectures may be optimizing the medium rather than creating a new capability.
Practical constraint on retrofitting: A critical caveat for deployment: Can continuous reasoning avoid forgetting in instruction-tuned models? shows that fine-tuning already-capable instruction-tuned models for continuous reasoning via Coconut/CCoT methods causes catastrophic forgetting. This limits the Coconut approach to training-from-scratch scenarios and motivates frozen-backbone alternatives for enhancing existing models.
Source: Reasoning Architectures, Novel Architectures, Cognitive Models Latent
Related concepts in this collection
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
latent recurrence is neither: it scales depth per token rather than breadth or chain length
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
latent reasoning suggests the token-is-thinking assumption embedded in all TTS benchmarks may be wrong
-
Can minimal reasoning chains match full explanations?
Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
CoD uses fewer tokens; latent reasoning uses zero tokens for intermediate steps; same direction of travel
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
latent recurrence with early stopping implements adaptive compute at the token level, not the prompt level
-
Can recurrent hierarchies achieve reasoning that transformers cannot?
Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
third latent reasoning architecture: hierarchical multi-timescale recurrence
-
Can parallel architectures solve fundamentally sequential problems?
Explores whether pure parallel computation—like Transformers—can tackle problems requiring long chains of dependent reasoning, or if serial depth is theoretically necessary for certain classes of problems.
complexity-theoretic foundation: latent recurrence is necessary for inherently serial problems
-
Can we explore multiple reasoning paths without committing to one token?
Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
training-free approach to continuous-space reasoning via probability-weighted token mixture
-
Can energy minimization unlock reasoning without domain-specific training?
Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?
fifth latent reasoning approach: energy minimization as iterative gradient descent at inference time, distinct from depth-recurrent, Heima, Coconut, and HRM; 35% higher scaling rate than Transformer++, modality-agnostic without domain-specific training
-
Where does LLM reasoning actually happen during generation?
Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
provides the theoretical framework (H1/H2/H0) that organizes all these architectures as evidence for H1
-
Can we trigger reasoning without explicit chain-of-thought prompts?
This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
mechanistic evidence: latent reasoning is not just architecturally achievable but causally controllable via a single feature
-
Can continuous reasoning avoid forgetting in instruction-tuned models?
Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
validates a practical concern: Coconut-style fine-tuning causes catastrophic forgetting on capable models; SoftCoT provides the retrofit-safe alternative
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
latent reasoning in continuous space scales test-time compute without verbalized tokens or specialized training data