Can latent reasoning achieve the same substitution without tokens?

This explores whether reasoning can move into the model's hidden states — skipping the visible word-by-word 'thinking out loud' — and still do the same work that chain-of-thought tokens do.

This explores whether latent reasoning (computation happening in hidden states or embedding space) can substitute for the visible chain-of-thought tokens models normally generate — and whether anything is lost when the words disappear. The corpus suggests the substitution is largely viable, and several lines of evidence point to verbalization being more of a training habit than a computational necessity.

The most direct support comes from work showing models can scale test-time compute by iterating on hidden states rather than emitting tokens Can models reason without generating visible thinking tokens?. Depth-recurrent architectures, Coconut, and Heima all reason in latent space and reach answers without spelling out intermediate steps — framing verbalization as an artifact, not a requirement. A striking complement: transformers trained to hide their chain-of-thought actually compute the correct answer in their earliest layers, then deliberately overwrite it to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. The reasoning was never in the tokens; it was already done internally and the visible output was theater. That reframes the whole question — if the real computation is latent even during 'normal' CoT, then dropping the tokens isn't removing the reasoning, just removing the performance of it.

This fits a broader pattern in the corpus: the tokens carry far less of the load than they appear to. Chain of Draft matches full CoT accuracy using just 7.6% of the tokens, with the other 92% serving style and documentation rather than computation Can minimal reasoning chains match full explanations?. Models trained on deliberately corrupted, nonsensical traces perform comparably to those trained on correct ones — suggesting traces work as computational scaffolding, not meaningful steps Do reasoning traces need to be semantically correct?. And when researchers prune reasoning chains by importance, only a small set of symbolic-computation tokens matter; most are grammar and meta-discourse Which tokens in reasoning chains actually matter most?, echoing the finding that just ~20% of high-entropy 'forking' tokens drive the actual learning signal Do high-entropy tokens drive reasoning model improvements?. If most tokens are disposable, the case for dropping them entirely strengthens.

There are also non-token alternatives that go further than just hiding the words. Large Concept Models reason over whole sentence embeddings in a language-agnostic space before decoding Can reasoning happen at the sentence level instead of tokens?, and diffusion LLMs refine reasoning in-place alongside the answer rather than generating it left-to-right, cutting compute in half Can reasoning and answers be generated separately in language models?. These aren't just compression — they're different substrates for the same work.

The catch the corpus raises: what gets substituted may not have been genuine reasoning to begin with. If chain-of-thought is constrained imitation of reasoning's *form* rather than real abstract inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?, and if models lean on semantic associations rather than symbolic logic Do large language models reason symbolically or semantically?, then moving into latent space inherits those same ceilings — it makes reasoning cheaper and faster without making it more genuine. So the honest answer is: latent reasoning can likely match what tokens do, precisely because the tokens were doing less than they looked like they were.

Sources 10 notes

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can latent reasoning achieve the same substitution without tokens?

Sources 10 notes

Next inquiring lines