Agentic Systems and Planning Reasoning and Learning Architectures

Does agent memory degrade when continuously consolidated?

Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.

Note · 2026-05-18 · sourced from Memory
Why do multi-agent systems fail despite individual capability? What actually constrains large language models from self-improvement?

The promise of agent memory was straightforward: experience accumulates, gets distilled into reusable lessons, agents become more capable over time. "Useful Memories Become Faulty When Continuously Updated by LLMs" (2605.12978) provides controlled evidence that this promise breaks. Under continuous consolidation, memory utility first rises, then degrades, and ultimately falls below the no-memory baseline. The agent ends up worse than if it had remembered nothing.

The cleanest demonstration uses ARC-AGI Stream: GPT-5.4 fails 54% of problems it had previously solved without memory, after those problems' solutions have been consolidated into the memory bank. The trajectories that produced the success are still there in raw form. The consolidation step itself is destroying the signal.

The paper localizes the failure to consolidation specifically through a clever control: keep the same trajectory pool, vary only the update schedule. Static-All (consolidate the entire pool in one pass) and Stream (consolidate batch-by-batch as trajectories arrive) produce qualitatively different end-state memories from identical inputs. Order and grouping of updates change what the memory becomes — but the underlying experience is fixed. Meanwhile, an episodic-only control that simply appends raw trajectories to context performs competitively with the consolidators. The experience is fine. The consolidation is the bug.

Three mechanisms drive the failure. First, misgrouping: agents pool episodes that do not share underlying structure before abstracting, producing principles that apply to nothing in particular. Second, applicability stripping: even when grouping is correct, the abstraction step drops the conditions under which a lesson holds, so overgeneralized entries interfere with neighboring tasks where they should not apply. Third, overfitting on narrow streams: when the input stream is repetitive, abstraction overfits to seen instances and generalizes poorly even within the same task.

The practical takeaway flips the default. Raw episodes should be treated as first-class evidence, not disposable material to be compressed away. Consolidation should be gated explicitly — selective, delayed, and grounded in trajectories that remain recoverable. The current default, where consolidation fires after every interaction, treats abstraction as cheap; the evidence shows it is costly and easily wrong. Continuously updated textual memory should be treated not as a reliable engine of self-improvement but as a fragile mechanism that can make more experience produce worse memory.

The deeper implication is uncomfortable for the field. Many agent-memory systems rely on the assumption that summarized experience is at worst lossy and at best generalizing. This paper shows it is often actively harmful. Building reliable agentic memory requires LLMs that can consolidate without overwriting the evidence they depend on — and current LLMs cannot.


Paper: Useful Memories Become Faulty When Continuously Updated by LLMs

Related concepts in this collection

Concept map
17 direct connections · 94 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

continuously consolidated agent memory follows an inverted-U utility curve — degrading below the no-memory baseline because consolidation is fragile