Useful Memories Become Faulty When Continuously Updated by LLMs
Learning from past experience benefits from two complementary forms of memory: episodic traces—raw trajectories of what happened—and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.
Consolidation is fragile in a second sense: the same set of trajectories can yield qualitatively different memories depending on the order and grouping of updates. Consolidating the whole trajectory pool in one pass (Static-All) versus streaming it batch-by-batch (Stream) produces different end states; updates on one task overwrite memory of another; and a stream of repeated near-duplicates causes the memory to overfit to seen instances and generalize poorly within the same task. Meanwhile, an episodic-only control that consumes the same trajectories without abstracting them—appending raw rollouts to context as in-context demonstrations—is already competitive with the lesson-style consolidators we test. Because the trajectory pool is held fixed across these comparisons, the variance across schedules and the episodic-only control together point to the consolidation step itself, not the underlying experience, as the source of the failure.
We further identify three mechanisms behind faulty memory. First, agents misgroup experiences before abstracting them, pooling episodes that do not share underlying structure. Second, even when grouping is correct, abstraction can strip the applicability conditions of a lesson, so that overgeneralized entries interfere with neighboring tasks. Third, when the input stream is narrow, abstraction overfits to seen instances. Together, these failure modes weaken the boundary between what should be generalized, what should remain task-specific, and what should be preserved as raw episodic evidence.
Persistent memory is meant to let LLM agents move beyond static competence: experience is stored, compressed into reusable lessons, and carried forward. We identify an issue that may undermine this promise — across agent benchmarks and ARC-AGI Stream, continuously updated textual memory can become less useful as experience accumulates; in the cleanest case, an agent becomes worse on the very problems its memory was built from. These findings suggest that raw episodes should be treated as first-class evidence, not disposable material to be compressed away. Abstraction should be selective, delayed, and grounded in recoverable trajectories. Until agents can control when and how to consolidate experience, continuously updated textual memory should be treated not as a reliable engine of self-improvement, but as a fragile mechanism that can make more experience produce worse memory.
At a conceptual level, agent memory and retrieval-augmented generation (RAG) exhibit substantial overlap: both systems construct, organize, and leverage auxiliary information stores to extend the capabilities of LLM/agents beyond their native parametric knowledge. Despite these technological convergences, the two paradigms have historically been distinguished by the contexts in which they are applied. Classical RAG techniques primarily augment an LLM with access to static knowledge sources, whether flat document stores, structured knowledge bases, or large corpora externally indexed to support retrieval on demand. In contrast, agent memory systems are instantiated within an agent's ongoing interaction with an environment, continuously incorporating new information generated by the agent's own actions and environmental feedback into a persistent memory base. A more practical (though not perfectly separable) distinction lies in the task domain. RAG is predominantly applied to augment LLMs with large, externally sourced context for individual inference tasks. By contrast, agent memory systems are typically evaluated in settings requiring sustained multi-turn interaction, temporal dependency, or environment-driven adaptation.