LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Note · 2026-02-23 · sourced from Cognitive Models Latent

Latent-Thought Language Models (LTMs) propose a different scaling strategy than larger parameters or longer contexts: explicit latent thought vectors that follow a prior model in latent space and guide autoregressive token generation. This creates additional scaling dimensions — higher sample efficiency by increasing training compute per token, with further gains by trading model size for more inference steps.

Architecture. Latent thought vectors represent an abstract representation of the entire sequence, controlling the decoder's generation of each token. Training uses variational Bayes with a dual-rate process: fast learning of local variational parameters for the posterior distribution of latent vectors (adapting quickly to specific inputs) coupled with slow learning of global decoder parameters (gradually accumulating general knowledge).

Cognitive inspiration. The dual-rate scheme parallels established cognitive models:

Scaling properties. LTMs demonstrate superior sample and parameter efficiency compared to conventional autoregressive models and discrete diffusion models. They significantly outperform on validation perplexity and zero-shot language modeling. Emergent few-shot in-context reasoning capabilities scale with both model size and latent size — providing two independent scaling dimensions.

The connection to existing latent reasoning approaches is important but distinct. Can models reason without generating visible thinking tokens? describes depth-recurrent architectures that iterate in latent space at inference time. LTMs use latent vectors differently — as sequence-level abstractions that guide token generation rather than per-token iterative computation. The dual-rate learning provides a training-time mechanism that depth-recurrence does not.

The Titans parallel is also notable: Can neural memory modules scale language models beyond attention limits? separates fast attention (short-term) from slow memory (long-term). LTMs separate fast local adaptation from slow global learning. Both architectures implement the fast-slow cognitive distinction but at different levels — Titans for memory, LTMs for generation.


Source: Cognitive Models Latent

Related concepts in this collection

Concept map
15 direct connections · 129 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

latent-thought language models introduce additional scaling dimensions beyond parameters by incorporating explicit latent thought vectors with dual-rate learning