Where does LLM reasoning actually happen during generation?
Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
The field studies "LLM reasoning" without agreeing on what the primary object of study is. Three views coexist but make incompatible predictions:
H2 (surface CoT): Multi-step reasoning is primarily mediated by explicit surface chain-of-thought. The chain IS the reasoning. This requires surface traces to provide the most stable causal leverage — but ordinary CoT is often useful without being reliably faithful, and its role varies sharply across tasks.
H0 (generic serial compute): Most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. More tokens = more FLOPs, regardless of what those tokens say. This requires matched serial compute to explain most gains — but extra budget alone cannot explain why specific internal states, features, or trajectories can predict or alter reasoning behavior.
H1 (latent-state trajectories): Multi-step reasoning is primarily mediated by latent-state trajectories, with surface CoT serving only as a partial interface. Task-relevant commitment arises in hidden-state dynamics that are only partly verbalized, or not verbalized at all.
The difficulty is that recent methods typically move several factors at once: CoT prompting changes both visible traces and compute allocation; latent reasoning methods change both hidden-state dynamics and compute budget; test-time scaling changes compute and usually changes the output path. Without designs that explicitly disentangle these three factors, experimental results cannot distinguish which hypothesis they support.
The paper argues H1 should be the default working hypothesis — not as a task-independent verdict, but because the strongest evidence currently available points toward latent-state dynamics as having the most stable causal leverage. The recommendation: treat latent-state dynamics as the default object of study and design evaluations that explicitly separate surface traces, latent states, and serial compute.
This framework organizes several existing findings. Because Do language models actually use their reasoning steps?, the H2 assumption is empirically weakened — if surface traces aren't causally faithful, they cannot be the primary reasoning medium. Because Does chain-of-thought reasoning reflect genuine thinking or performance?, H2 fails specifically on easy tasks (where the answer is determined before CoT begins) while H1 and H0 remain viable. Because Can we trigger reasoning without explicit chain-of-thought prompts?, direct latent intervention provides causal evidence for H1 that neither H2 nor H0 can explain.
Additional evidence converges from multiple angles. Because Why does reasoning training help math but hurt medical tasks?, the layer separation provides architectural grounding for H1: reasoning is a latent higher-layer process, not a surface token-generation phenomenon. Because Why do language models fail to act on their own reasoning?, even when the surface trace (rationale) is correct, the latent computation (action selection) diverges — a behavioral signature of the surface-latent disconnect that H1 predicts. And because Can we measure how deeply a model actually reasons?, there now exists an H1-native measurement tool: DTR tracks latent computational depth per token rather than surface trace properties, and it outperforms surface-level metrics as an accuracy predictor.
The sharpest implication: the field's default assumption (H2) may be distorting research priorities. If the reasoning object is latent, then benchmarks that evaluate chains, faithfulness metrics that read traces, and interpretability methods that parse CoT are all measuring a secondary phenomenon.
Source: Cognitive Models Latent Paper: LLM Reasoning Is Latent, Not the Chain of Thought
Related concepts in this collection
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
empirical evidence weakening H2
-
Does chain-of-thought reasoning reflect genuine thinking or performance?
When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.
difficulty-dependent H2 failure
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
H1 implementations
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
theoretical argument against H2
-
Can we trigger reasoning without explicit chain-of-thought prompts?
This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
causal evidence for H1
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
provides layer-level mechanistic grounding for H1: reasoning localizes to higher layers as a latent process, not as surface token generation
-
Can we measure how deeply a model actually reasons?
What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
an H1-native measurement: DTR measures latent computational depth rather than surface trace properties
-
Why do language models fail to act on their own reasoning?
LLMs generate correct step-by-step reasoning 87% of the time but only follow through with matching actions 64% of the time. What drives this gap between knowing and doing?
behavioral evidence for the latent-surface disconnect: models produce correct surface reasoning but act on latent computations that don't follow it
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
LLM reasoning should be studied as latent-state trajectory formation not as faithful surface chain-of-thought — three competing hypotheses can be empirically separated