Why is latent-level prediction more sample-efficient than token-level prediction?
This explores why learning to predict a model's own internal representations (latents) — the data2vec/JEPA family — needs far fewer training examples than learning to predict the next token, and what that gap tells us about how structure is acquired.
This explores why predicting latents beats predicting tokens on sample efficiency — and the corpus has a sharp, formal answer plus several papers that triangulate the intuition from other angles. The core result Why is predicting latents more sample-efficient than tokens? is a sample-complexity proof: when data has compositional hierarchy, same-level latent representations are far more correlated with each other than raw tokens are. Because the learning signal lives in those correlations, latent self-supervision recovers the hierarchy with a number of samples that stays roughly constant in the depth of the hierarchy, while token-level learning pays an exponential cost as depth grows. The short version: tokens are a noisy, high-entropy surface; latents are a smoothed, abstracted layer where the underlying structure is already half-exposed, so each example teaches more.
The reason that's interesting is that several other notes are circling the same idea from different rooms. The Large Concept Model Can reasoning happen at the sentence level instead of tokens? moves reasoning up to sentence-level embeddings in a language-agnostic space — abstracting away from tokens entirely — and gets more coherent output with hierarchical planning. Latent-Thought Language Models Can latent thought vectors scale language models beyond parameters? report exactly the predicted symptom: superior sample and parameter efficiency, achieved by learning fast over local latent variables and slow over the global decoder. Both are, in effect, cashing in the same correlation-rich-latent advantage that the proof formalizes.
There's a complementary clue in the work on which tokens actually carry signal. RLVR research finds only ~20% of tokens are high-entropy 'forking points,' and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?; a separate pruning study shows models rank tokens by functional importance, preserving symbolic-computation tokens first Which tokens in reasoning chains actually matter most?. Read alongside the latent result, these say the token stream is mostly redundant filler around a few load-bearing decisions — which is precisely why token-level prediction is wasteful: most of the examples teach grammar and predictable continuation, not structure.
The Byte Latent Transformer Can byte-level models match tokenized performance with better efficiency? makes the same bet at the input end: segment bytes by entropy and spend compute where the surface is unpredictable, less where it's not. And MobileLLM's finding that depth beats width Does depth matter more than width for tiny language models? is the architectural echo — composing abstract concepts through layers (building a latent hierarchy) outperforms spreading raw capacity across width.
Worth a caveat the corpus also supplies: latent-space prediction isn't magic. When researchers asked whether LLMs actually run iterative procedures in latent space, they found the models pattern-match memorized templates instead Do large language models actually perform iterative optimization?. So 'reasoning in latent space' buys sample efficiency for recovering structure, but doesn't by itself confer genuine step-by-step computation — a useful boundary to keep in mind when the efficiency argument starts sounding like a free lunch.
Sources 8 notes
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
The Byte Latent Transformer (BLT) dynamically segments bytes into patches based on next-byte entropy, allocating more compute to high-entropy regions and less to predictable ones. At 8B parameters, BLT matches tokenized baselines while reducing inference cost and improving robustness to typos and cross-lingual transfer.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.