Reasoning to Learn from Latent Thoughts

Paper · arXiv 2503.18866 · Published March 24, 2025

Human-written text is the culmination of an underlying thought process—when we write, there is often an internal dialogue that clarifies or even determines the written word. However, modern language models (LMs) (Radford et al., 2019; Brown et al., 2020; OpenAI, 2023; Dubey et al., 2024) are pretrained directly on the final results of this process in a highly compressed form (such as research papers). This may explain why LMs struggle with data efficiency and require a large portion of the entire human-written web to learn (Kaplan et al., 2020; Hoffmann et al., 2022). Since the rate of growth in pretraining compute is far greater than that of the web itself (Villalobos et al., 2022; Muennighoff et al., 2024), we may soon enter a data-constrained regime, motivating data efficiency approaches to extract more capabilities from limited web data.

In contrast to LMs, humans learn very efficiently from the same compressed text, which suggests the possibility of significantly improving data-efficient pretraining. In this work, we focus on how we learn as one potential cause for this gap. For example, when we read a research paper, we analyze specific claims, integrate them with prior knowledge, and attempt to “decompress” the author’s original thought process. In other words, we use reasoning in service of learning, to infer the internal dialogue that undergirds the observed text. We refer to this procedure—augmenting the observed data with inferred, decompressed thoughts to enable more efficient learning—as reasoning to learn.

Inspired by this, we introduce an LM pretraining approach that implements this reasoning-to-learn paradigm to improve data efficiency (Fig. 1). Specifically, we approach language modeling from a latent variable perspective, where the observed data X depends on underlying latent thoughts Z. We train our LMs to learn from observed data X augmented with the latents Z by modelling the joint distribution p(Z, X). The main challenge is synthesizing (and learning to synthesize) Z with a latent thought generator q(Z | X) (Fig. 2a).

One key insight of our work is to observe that for a natural language latent thought Z, the LM itself provides a strong prior for producing latent thoughts (via its reasoning and theory-of-mind abilities (Wei et al., 2022b; Kojima et al., 2022)). This observation turns latent thought inference into a synthetic data generation problem and has significant practical benefits—it allows us to leverage the strong capabilities of existing LMs, allows us to share weights between the LM and the latent thought generator, and simplifies training into a small modification to the standard pretraining pipeline (Fig. 2b).

We show that training a model with latent thoughts enables it to produce higher-quality latent thoughts, allowing a model to bootstrap its “reasoning to learn” ability with only a small amount of initial supervision. We demonstrate this through a simple Expectation- Maximization based approach which we refer to as Bootstrapping Latent Thoughts (BoLT) that enables an iterative improvement of the latent thought generator (Fig. 5). Importantly, we show that BoLT can take advantage of additional inference compute to further improve data efficiency. In particular, the E-step in BoLT makes use of a Monte-Carlo estimator that serves as a non-parametric “policy improvement operator”, where the approximate posterior q(Z | X) approaches the true posterior as the number of samples increases. We find in our experiments that BoLT is able to take advantage of additional samples (at least four) to improve its data efficiency and bootstrap its performance for at least three iterations, opening the possibility of new ways of scaling pretraining data efficiency.