How should agents compress episodic interactions into working memory without accumulation?
This explores how agents can fold a growing stream of past interactions into a compact working memory — and why naive 'just keep consolidating everything' actually makes them worse, not better.
This explores how agents can fold a growing stream of past interactions into a compact working memory without the memory bloating or degrading as experience piles up. The corpus has a surprisingly pointed answer: the danger isn't running out of space, it's compressing *carelessly*. The clearest warning sign is the inverted-U curve — agents that continuously re-consolidate their textual memory improve for a while, then get *worse than having no memory at all* Does agent memory degrade when continuously consolidated?. One study found a model failing more than half of problems it had previously solved, traced to three failure modes: misgrouping unrelated events, stripping away the conditions that made a lesson applicable, and overfitting to a narrow recent stream. The same fragile pattern shows up when a single model handles generation, compression, and response all at once Can a single model replace retrieval for long-term conversation memory?. So 'without accumulation' is the right instinct — but the cure (aggressive merging) can be worse than the disease.
The most promising designs avoid this by *structuring* memory rather than flattening it. DeepAgent's autonomous memory folding doesn't dump history into one blob — it sorts it into distinct episodic, working, and tool schemas, which is what lets compression happen without the degradation that plagues naive consolidation Can agents compress their own memory without losing critical details?. RAISE pushes the same idea further, showing agent memory naturally splits into four components across two granularities — dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory) — and that each component wants its *own* update and eviction policy How should agent memory split across time scales?. The lesson across both: compress within a type, never across types.
The sharpest insight is that compression should be *asymmetric*. SkillRL treats successful episodes as concrete demonstrations worth keeping verbatim, but distills failures into abstract lessons — and beats uniform consolidation while using far less context Should successful and failed episodes be processed differently?. This echoes a subtler point from Reflexion: some material should *resist* compression entirely. Verbal self-reflections stored in episodic memory stay useful precisely because they're kept uncompressed and tied to an unambiguous success/failure signal — squeeze them and you lose the diagnosis Can agents learn from failure without updating their weights?. So 'without accumulation' doesn't mean 'compress everything'; it means knowing what to abstract, what to keep raw, and what to drop.
There's also a school of thought that says don't make the agent compress its own memory at all. An external, RL-trained context manager can prune for a frozen agent better than the agent can for itself — and crucially, it adapts the compression rate to the agent's reliability: strong agents get high-fidelity preservation, weak agents need aggressive pruning to stay coherent Can external managers compress context better than frozen agents?. AgentFly takes the orthogonal route of treating memory operations *as* the learning mechanism, with separate case, subtask, and tool modules doing credit assignment without ever touching the model's weights Can agents learn continuously from experience without updating weights?.
The thing you might not have known you wanted to know: a strand of this research argues the real bottleneck was never storage capacity but *compute* — the cost of transforming evicted context into internal state, with performance improving the more 'consolidation passes' you spend, like a sleep phase that runs longer on harder problems Is long-context bottleneck really about memory or compute?. And at the architecture level, Titans builds this in directly: it splits short-term attention from a long-term neural memory that preferentially stores *surprising* tokens — a principled stance on what's worth keeping, rather than compressing uniformly and hoping Can neural memory modules scale language models beyond attention limits?. The throughline of the whole corpus: good memory compression is selective, structured, and asymmetric — uniform consolidation is the trap, not the solution.
Sources 10 notes
LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.
COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
An external RL-trained manager can adaptively prune context for frozen agents, with the key insight that stronger agents benefit from high-fidelity preservation while weaker agents need aggressive compression to stay reliable.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.