INQUIRING LINE

How does latent reasoning recursion compare to chain-of-thought reasoning?

This explores the difference between reasoning that loops in a model's hidden state (recursive latent reasoning) and reasoning written out as explicit text steps (chain-of-thought) — what each actually does and where the real computation lives.


This explores the contrast between reasoning that recurses silently in a model's internal state and reasoning spelled out as visible step-by-step text. The corpus suggests these aren't just two styles of the same thing — they may be operating at different layers entirely, with chain-of-thought serving as a partial readout of a process that mostly happens elsewhere.

The sharpest framing is that LLM reasoning is best understood as a trajectory through hidden states, not as the text it produces Where does LLM reasoning actually happen during generation?. On this view, the visible chain-of-thought is an interface — a surface narration — while the actual inference runs underneath it. That reframes the comparison: latent reasoning isn't an alternative to CoT so much as the thing CoT is gesturing at. A striking demonstration is that steering a single internal feature can trigger reasoning behavior that matches or beats explicit CoT prompting, and it does so without writing any steps at all Can we trigger reasoning without explicit chain-of-thought prompts?. The reasoning was a latent capability the whole time; the prose was optional.

Where recursion adds something genuinely new is in handling uncertainty. Recursive latent reasoners that update their internal state deterministically can only carry one line of thought forward. Making those latent transitions stochastic lets a model hold a distribution over possible solutions instead of committing early — useful when a problem is ambiguous or has several valid strategies Can stochastic latent reasoning help models explore multiple solutions?. That same machinery lets reasoning scale in width by sampling parallel internal trajectories rather than only deeper, sidestepping the serial latency cost of longer and longer chains Can reasoning systems scale wider instead of only deeper?. Chain-of-thought, being a single linear text stream, is structurally stuck scaling in depth.

Meanwhile, the corpus is unusually skeptical about what CoT really is. Several notes converge on the verdict that chain-of-thought reproduces the *form* of reasoning through learned patterns rather than performing genuine logical inference — performance tracks format more than content, structurally invalid prompts work nearly as well as valid ones, and accuracy degrades predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning actually work? What makes chain-of-thought reasoning actually work?. The same brittleness shows up when semantic content is stripped out: models reason through associations, not symbol manipulation Do large language models reason symbolically or semantically?. So CoT's apparent transparency may be partly theater.

The practical upshot is that much of a chain-of-thought is doing no computational work. Concise chains match verbose ones at under 8% of the tokens Can minimal reasoning chains match full explanations?, dynamic pruning can cut three-quarters of steps with no accuracy loss Can reasoning steps be dynamically pruned without losing accuracy?, optimal length follows an inverted-U that shrinks as models get more capable Why does chain of thought accuracy eventually decline with length?, and for simple questions step-by-step prompting actively hurts Why do some questions perform better without step-by-step reasoning?. Read together, these point the same direction as the latent-reasoning work: the verbose chain is mostly documentation wrapped around a much smaller hidden core. The interesting question the corpus leaves you with isn't 'which is better' — it's whether explicit reasoning text is a window into the computation or a story told after the fact.


Sources 12 notes

Where does LLM reasoning actually happen during generation?

Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Next inquiring lines