Scalable Language Models with Posterior Inference of Latent Thought Vectors

Paper · arXiv 2502.01567 · Published February 3, 2025

We propose a novel family of language models, Latent-Thought Language Models (LTMs), which incorporate explicit latent thought vectors that follow an explicit prior model in latent space. These latent thought vectors guide the autoregressive generation of ground tokens through a Transformer decoder. Training employs a dual-rate optimization process within the classical variational Bayes framework: fast learning of local variational parameters for the posterior distribution of latent vectors, and slow learning of global decoder parameters. Empirical studies reveal that LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space. Higher sample efficiency can be achieved by increasing training compute per token, with further gains possible by trading model size for more inference steps. Designed based on these scaling properties, LTMs demonstrate superior sample and parameter efficiency compared to conventional autoregressive models and discrete diffusion models. They significantly outperform these counterparts in validation perplexity and zero-shot language modeling. Additionally, LTMs exhibit emergent few-shot in-context reasoning capabilities that scale with model and latent size, and achieve competitive performance in conditional and unconditional text generation.

However, recent observations of diminishing returns from scaling in gigantic models have led researchers to explore alternative scaling dimensions (Snell et al., 2024). We are thus motivated to explore a novel family of language models that expand these possibilities. We propose Latent-Thought Language Models (LTMs), which incorporate explicit latent thought vectors that follow explicit prior model in the latent space. These latent vectors control an autoregressive Transformer decoder’s (Vaswani et al., 2017) generation of each token throughout the sequence, effectively creating an abstract representation of the entire sequence. LTMs are trained within the classical variational Bayes framework (Jordan et al., 1999; Blei et al., 2017; Murphy, 2012), with a dual-rate optimization process: fast learning of local variational parameters for the posterior distribution of latent thought vectors, coupled with slow learning of global decoder parameters. This approach enables rapid adaptation to specific inputs while gradually accumulating general linguistic knowledge.

LTMs’ architecture and learning scheme are inspired by established cognitive models. Within the framework of the declarative-procedural model (Ullman, 2004), the latent thought vectors and local variational parameters parallel the declarative or episodic memory, while the global decoder parameters correspond to procedural memory. The dual-rate learning scheme reflects the interplay between fast episodic learning and slow schematic learning in human cognition (Kumaran et al., 2016). Furthermore, in the context of language of thought hypothesis (Fodor, 1975), the latent thought vectors can be interpreted as “words” of a language of thought.