Featured

Learn from your own latents and not from tokens: A sample-complexity theory

Daniel J. Korchinski, Alessandro Favero, Matthieu Wyart · arXiv:2605.27734v1

Recent work has challenged the token-centric paradigm that dominates both LLMs and diffusion models, exploring whether networks can learn more efficiently by predicting higher-level semantic representations rather than reconstructing surface-level sequences. This paper provides formal justification for that intuition through a compositional grammar lens, proving exponential sample complexity gaps between token-level learning and latent prediction—a finding that connects to broader questions about how hierarchical structure guides learning efficiency in language. Yet the theoretical result sits in tension with empirical practice: if latent prediction's benefits hold even without explicit multi-scale stacking, why do hierarchical variants like H-JEPA still show gains, and what does that gap between theory and implementation tell us about how neural networks actually discover compositional structure?

Abstract

Generative models, from diffusion models to large language models, achieve remarkable performance but at a cost in training data orders of magnitude larger than what biological learners require. An alternative paradigm has emerged in which networks are trained to predict their \emph{own} latent representations of related views or masked regions, as in data2vec and JEPA -- an idea related to predictive-coding accounts of the cortex. Despite strong empirical results, the theoretical understanding of these methods remains limited. Central questions include: by how much does latent prediction actually improve data efficiency? Is there a benefit to stacking such methods into multi-scale hierarchies? We answer both using as data a tractable probabilistic context-free grammar that captures the compositional structure of natural language and images. Such a grammar generates strings of visible tokens by recursively applying production rules along a tree of hidden symbols of depth $L$. For such data, supervised or token-level SSL require a number of samples \emph{exponential} in $L$ to recover the latent tree; we prove that latent prediction achieves this with a number of samples \emph{constant} in $L$, up to logarithmic factors. We confirm this bound with (i) a hierarchical clustering algorithm, (ii) an end-to-end neural network whose predictor-clusterer modules predict their own latents at each level via gradient descent, and (iii) the first sample-complexity analysis of data2vec, which we show implicitly performs hierarchical latent prediction. This suggests that explicit stacking such as H-JEPA is largely redundant.

Adjacent research

Synthesis notes nearest this paper, framed as questions — click to read.

Can reasoning happen at the sentence level instead of tokens? Can explicit stack tracking improve how transformers learn recursive syntax? Can formal language pretraining make language models more efficient?

Lines of inquiry this paper opens

Explore in faceted view

Not questions with answers — ways of approaching this research. Each opens a synthesized line of inquiry across the collection.

LLM Cognitive Limitations

Scaling, Sparsity & Data Trade-offs

Reasoning Model Failure Modes

Prompt Optimization And Context

Reasoning Model Quality & Training

Training Dynamics And Generalization

Do transformers learn generalizable algorithms or instance-based patterns?

Attention And Memory Mechanisms

What structural differences between diffusion and autoregressive models enable bidirectional prompting?

All featured →