Can a single regularizer prevent JEPA representation collapse?
JEPAs traditionally need complex loss stacks and auxiliary tricks to avoid collapse. Can a single Gaussian-distribution constraint on latent embeddings do the same stabilization work, and would that simplify training?
Joint-Embedding Predictive Architectures (JEPAs) learn world models in compact latent spaces, but existing methods are fragile — they rely on complex multi-term losses, exponential moving averages, pretrained encoders, or auxiliary supervision to avoid representation collapse (the degenerate solution where the encoder maps everything to a constant). The engineering needed to keep them stable is itself the barrier.
LeWorldModel (LeWM) is the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. That single regularizer does the anti-collapse work that the usual stack of tricks did, cutting tunable loss hyperparameters from six to one. The payoff is practical: 15M parameters trainable on one GPU in hours, planning up to 48× faster than foundation-model-based world models, competitive across 2D and 3D control. The latent space also encodes meaningful physical structure — probing recovers physical quantities, and the model reliably flags physically implausible events.
The general lesson is that simplicity in the self-supervised objective can replace brittle engineering: collapse is prevented by an explicit distributional constraint rather than by carefully balanced auxiliary terms. This is the practical face of Why is predicting latents more sample-efficient than tokens? — the theory says latent prediction is the efficient target; LeWM shows the missing piece was a principled way to keep those latents non-degenerate.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why is predicting latents more sample-efficient than tokens?
Explores whether learning from a network's own abstract representations requires far fewer training samples than learning from raw tokens, and what mechanism drives this efficiency gap.
the sample-complexity theory LeWM instantiates
-
Can unlabeled UI video teach models what users intend?
Can temporal masking on screen recordings learn task-aware representations without paired text labels? This matters because labeled UI video is scarce and expensive, so self-supervised learning could unlock scaling.
another applied JEPA; both exploit predictive latent self-supervision
-
Can recurrent hierarchies achieve reasoning that transformers cannot?
Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
adjacent use of compact latent dynamics as the substrate for planning/reasoning
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- Everything Everywhere All At Once: Llms Can In-context Learn Multiple Tasks In Superposition
- Thinkless: LLM Learns When to Think
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
- Learn from your own latents and not from tokens: A sample-complexity theory
Original note title
a single Gaussian-latent regularizer prevents JEPA representation collapse replacing the fragile stack of EMAs stop-gradients and auxiliary losses