LLM Reasoning Is Latent, Not the Chain of Thought

Paper · arXiv 2604.15726 · Published April 17, 2026

This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chainof- thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.

Large language models now solve many arithmetic, symbolic, and planning-like tasks more effectively when they are given extra intermediate computation. In practice, this computation may appear as explicit chains of thought, self-consistency, deliberate search, or other inference-time expansions [1– 5]. These gains have made reasoning a central object of current LLM research. They have also made it harder to say what the field is actually studying when it studies reasoning. This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time interventions all depend on what the field takes the primary object of reasoning to be.

Current work supports at least three incompatible readings of this same phenomenon. One view treats surface chain-of-thought as the reasoning process itself. A second view treats many reasoning gains as consequences of extra serial compute, regardless of representational form. A third view treats multi-step reasoning as a latent process that can be only partly verbalized, or not verbalized at all [6–10]. The difficulty is that recent methods often move several explanatory factors at once, making experimental results hard to interpret as causal support for any specific view: Chain-of-thought prompting changes both visible traces and compute allocation; Latent reasoning methods often change both hidden-state dynamics and compute budget; Test-time scaling changes compute and usually changes the output path as well [4, 5, 9].

The first task, then, is to separate the objects that recent work often conflates. Section 2 does so by distinguishing surface traces, latent-state dynamics, and generic serial compute, and by turning three loose views into three concrete hypotheses: H2 treats multi-step reasoning as primarily mediated by explicit surface CoT; H0 treats most apparent reasoning gains as better explained by generic serial compute than by any privileged representational object; and H1 treats multi-step reasoning as primarily mediated by latent-state trajectories, with surface CoT serving only as a partial interface. Because H2, H0, and H1 assign explanatory priority to surface traces, serial compute, and latent trajectories, they make different predictions about where the strongest causal leverage should lie. Stated this way, the debate is no longer about whether CoT helps. It is about what such help is evidence of. Under that standard, the current record does not equally support all three views. The strongest case for H2 would require surface traces to provide the most stable causal leverage, yet ordinary CoT is often useful without being reliably faithful, and its role varies sharply across tasks [6, 7]. The strongest case for H0 would require matched serial compute to explain most reasoning gains, yet extra budget alone does not explain why specific internal states, features, or trajectories can predict or alter reasoning behavior [4, 5]. By contrast, recent work on latent-state monitoring and latent reasoning suggests that task-relevant commitment can arise in hidden-state dynamics that are only partly verbalized, or not verbalized at all [8–10]. Section 3 develops this comparison in detail. We therefore argue that latent-state dynamics should be treated as the default working object of study for LLM reasoning, rather than assuming faithful surface chain-of-thought by default.