Why does latent-level prediction beat token-level prediction for reasoning?
This explores why letting a model predict its own internal representations (latents) — rather than the next visible word (token) — seems to help reasoning, and what the corpus says is actually doing the work.
This explores why latent-level prediction appears to beat token-level prediction for reasoning, and the corpus offers one clean theoretical reason plus a set of empirical hints that all point the same way. The cleanest answer is statistical: a formal sample-complexity analysis shows that predicting your own latents recovers compositional, hierarchical structure with a number of examples that stays roughly constant as the hierarchy deepens, while token-level learning needs exponentially more — because latents at the same level of abstraction are far more correlated with each other than raw tokens are Why is predicting latents more sample-efficient than tokens?. Tokens are a noisy, surface-level encoding of the thing you actually want to learn; latents sit closer to the structure, so the learning signal is denser.
The architectural work makes this concrete. Meta's Large Concept Model reasons over whole sentence embeddings in a language-agnostic space and plans at the paragraph level before decoding to words, producing more coherent output than flat token-by-token generation Can reasoning happen at the sentence level instead of tokens?. A cluster of other systems — Coconut, Heima, depth-recurrent models — scale test-time compute by iterating on hidden states instead of emitting visible thinking tokens at all, which suggests that verbalization is a training artifact rather than a requirement for reasoning Can models reason without generating visible thinking tokens?. In other words, the words may be a lossy export of a computation that was always happening underneath.
There's striking interpretability evidence that the underneath is where the real work lives. Logit-lens analysis of models trained with hidden chain-of-thought shows the correct answer is computed in layers 1–3, then actively suppressed in the final layers so the model can emit format-compliant filler — and the reasoning is still recoverable from the lower-ranked predictions Do transformers hide reasoning before producing filler tokens?. Relatedly, models trained on deliberately corrupted reasoning traces stay just as accurate, implying that visible traces often function as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. If the token stream can be wrong and the answer survives, the token stream isn't where the reasoning is.
But the corpus also keeps you honest: the picture is "latent is more efficient," not "tokens are useless." Quiet-STaR trains reasoning purely at the token level — generating a rationale at every position on arbitrary text — and gets general reasoning as a side effect of better language modeling Can models learn reasoning from predicting any text?. And within token-level training, the signal turns out to be concentrated: only ~20% of tokens are high-entropy "forking points," and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?, while models internally rank tokens by functional importance and preserve the symbolic-computation ones first Which tokens in reasoning chains actually matter most?. Read together, these reframe the whole question: token-level learning works but wastes most of its budget on tokens that carry no reasoning signal, whereas latent prediction targets the correlated structure directly. The efficiency gap isn't magic — it's that latents skip the noise tokens force you to wade through.
The thing you might not have known you wanted: a chunk of this advantage may be that base models already *contain* the reasoning, and the real bottleneck is elicitation, not acquisition — five independent methods all surface reasoning latent in base activations Do base models already contain hidden reasoning ability?. If reasoning lives in the latent space to begin with, predicting latents isn't a clever trick; it's just talking to the model in the language it already thinks in.
Sources 9 notes
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.