How should memory consolidation timing differ across multiple timescales?

This explores when an AI system should compress, transfer, or rewrite what it remembers — and why that timing shouldn't be the same for fast-moving recent context as it is for long-term knowledge.

This explores when an AI system should consolidate memory — and the corpus's strongest claim is that there isn't one good moment, because different memory layers live on different clocks. The clearest map comes from agent design, where memory splits along a granularity axis: dialogue-level memory (the whole conversation, a running scratchpad) and turn-level memory (the current task, recent examples) decay and update at different rates, so each one calls for its own refresh policy rather than a single global one How should agent memory split across time scales?. A separate architecture makes the same cut in hardware terms: attention handles the fast, short-term window while a dedicated neural memory module handles slow, long-term storage, deciding what to keep by how surprising a token is Can neural memory modules scale language models beyond attention limits?. Two timescales, two mechanisms — that pairing keeps recurring.

The most useful thing you might not expect: consolidating *too eagerly* actively destroys memory. When an agent continuously rewrites its accumulated experience into tidy summaries, utility follows an inverted-U — it improves, peaks, then degrades below the value of just keeping raw episodes, with one system failing more than half the problems it had previously solved. The failure modes are specific: misgrouping unrelated experiences, stripping away the conditions that made a lesson applicable, and overfitting to a narrow recent stream Does agent memory degrade when continuously consolidated?. That's the timing lesson stated negatively: fast, constant consolidation isn't a virtue. It's where memory rots.

So when *should* the slow pass happen? One striking proposal borrows from biology: consolidation runs offline, during a 'sleep' phase, where recurrent passes with no new input transfer recent context into persistent fast weights — mirroring hippocampal replay Can recurrence consolidate memory without predicting tokens?. The point is that consolidation is decoupled from the moment-to-moment work of prediction, so it can be scheduled and given its own compute budget instead of competing with live inference. Agents that fold their own history into structured episodic/working/tool schemas show the same instinct — pausing to reconsider rather than rewriting on every step is what lets them compress without the degradation that wrecks naive consolidators Can agents compress their own memory without losing critical details?.

Underneath all of this sits a routing principle: decide *what* goes on the slow clock versus the fast clock, not just when to run each. Fast-Slow Training routes durable, task-specific lessons into slowly-changing weights while letting fast textual context absorb the volatile stuff — and shows that catastrophic forgetting is a misallocation problem, not an unavoidable tax Can splitting adaptation into two channels reduce forgetting?. The older Wide & Deep intuition rhymes with it: keep memorization (rare, specific, fast-updating) and generalization (smooth, slow-updating) in separate channels so each can specialize Can one model memorize and generalize better than two?.

Put together, the corpus's answer is less 'consolidate every N steps' and more a design discipline: separate timescales explicitly, run the slow consolidation offline and infrequently, route each kind of knowledge to the clock that fits it — and resist the temptation to over-consolidate, because the fast, greedy version of memory cleanup is the one that quietly erases what you wanted to keep.

Sources 7 notes

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Does agent memory degrade when continuously consolidated?

LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.

Can recurrence consolidate memory without predicting tokens?

Language models can use recurrent passes without input tokens to transfer recent context into persistent fast weights via learned local rules, mirroring hippocampal replay during biological sleep. This separates consolidation from prediction, enabling different scheduling and compute allocation.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can one model memorize and generalize better than two?

Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.

How should memory consolidation timing differ across multiple timescales?

Sources 7 notes

Next inquiring lines