INQUIRING LINE

Can models consolidate context into weights during idle offline phases?

This explores whether an LLM can take what it has read in its context window and 'bake' it into its actual parameters during downtime — the way sleep is thought to consolidate memory in brains — rather than just holding everything in active attention.


This explores whether a model can take what it's read in its context window and fold it into its actual weights during quiet downtime — sleep-style consolidation — rather than juggling everything in live attention. The corpus says yes, and reframes what the real bottleneck is. One line of work argues that the limit on long context isn't memory capacity at all but *compute*: the bottleneck is the work needed to transform evicted context into fast weights during offline 'sleep' phases, and performance keeps improving the more consolidation passes you run — a test-time scaling pattern where harder reasoning rewards more sleep Is long-context bottleneck really about memory or compute?. So consolidation isn't a fixed step; it's a dial you can turn.

What makes this interesting is *why* you'd want context in the weights rather than the window. Models routinely ignore what's in front of them: when training-time associations are strong, parametric knowledge overrides in-context information, and prompting alone can't fix it — you need to intervene in the representations themselves Why do language models ignore information in their context?. Consolidating context into weights is one way to make in-context information *win*, by moving it from the fragile attention buffer into the same substrate as the priors it competes with.

There's a whole adjacent family that compresses context without a literal offline phase. The Titans architecture splits short-term attention from a long-term neural memory module that adaptively memorizes 'surprising' tokens, scaling past two million tokens without the quadratic cost — consolidation as a continuous, surprise-gated write rather than a nightly batch Can neural memory modules scale language models beyond attention limits?. Others sidestep the problem by managing what stays resident at all: recursive subtask trees that prune the KV cache keep reasoning accurate even after discarding 90% of it Can recursive subtask trees overcome context window limits?, and Markov-style reasoning deliberately forgets history so each step depends only on the current state Can reasoning systems forget history without losing coherence?. These are the opposite bet from consolidation: instead of saving context, throw it away cleanly.

The sharpest contrast is whether you touch weights at all. AgentFly shows agents improving continually through *memory operations alone* — episodic case, subtask, and tool memory drive credit assignment and policy improvement, hitting 87.88% on GAIA with the LLM's parameters frozen Can agents learn continuously from experience without updating weights?. So the corpus stakes out two genuinely different answers to 'where should learned context live': in the weights (offline consolidation) or in an external memory store that never disturbs the weights at all. The thing you didn't know you wanted to know: the long-context problem may not be a memory-size problem at all, but a question of how much compute you're willing to spend turning yesterday's reading into today's instincts.


Sources 6 notes

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Next inquiring lines