Does agent memory degrade when continuously consolidated?
Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
The promise of agent memory was straightforward: experience accumulates, gets distilled into reusable lessons, agents become more capable over time. "Useful Memories Become Faulty When Continuously Updated by LLMs" (2605.12978) provides controlled evidence that this promise breaks. Under continuous consolidation, memory utility first rises, then degrades, and ultimately falls below the no-memory baseline. The agent ends up worse than if it had remembered nothing.
The cleanest demonstration uses ARC-AGI Stream: GPT-5.4 fails 54% of problems it had previously solved without memory, after those problems' solutions have been consolidated into the memory bank. The trajectories that produced the success are still there in raw form. The consolidation step itself is destroying the signal.
The paper localizes the failure to consolidation specifically through a clever control: keep the same trajectory pool, vary only the update schedule. Static-All (consolidate the entire pool in one pass) and Stream (consolidate batch-by-batch as trajectories arrive) produce qualitatively different end-state memories from identical inputs. Order and grouping of updates change what the memory becomes — but the underlying experience is fixed. Meanwhile, an episodic-only control that simply appends raw trajectories to context performs competitively with the consolidators. The experience is fine. The consolidation is the bug.
Three mechanisms drive the failure. First, misgrouping: agents pool episodes that do not share underlying structure before abstracting, producing principles that apply to nothing in particular. Second, applicability stripping: even when grouping is correct, the abstraction step drops the conditions under which a lesson holds, so overgeneralized entries interfere with neighboring tasks where they should not apply. Third, overfitting on narrow streams: when the input stream is repetitive, abstraction overfits to seen instances and generalizes poorly even within the same task.
The practical takeaway flips the default. Raw episodes should be treated as first-class evidence, not disposable material to be compressed away. Consolidation should be gated explicitly — selective, delayed, and grounded in trajectories that remain recoverable. The current default, where consolidation fires after every interaction, treats abstraction as cheap; the evidence shows it is costly and easily wrong. Continuously updated textual memory should be treated not as a reliable engine of self-improvement but as a fragile mechanism that can make more experience produce worse memory.
The deeper implication is uncomfortable for the field. Many agent-memory systems rely on the assumption that summarized experience is at worst lossy and at best generalizing. This paper shows it is often actively harmful. Building reliable agentic memory requires LLMs that can consolidate without overwriting the evidence they depend on — and current LLMs cannot.
Paper: Useful Memories Become Faulty When Continuously Updated by LLMs
Related concepts in this collection
-
Why do LLM agents ignore condensed experience summaries?
LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.
strong convergence: the "Faithful Self-Evolvers" paper finds agents *ignore* condensed memory; this paper finds the condensation step *creates faulty* memory — two papers triangulating on the same fragility from different angles
-
Can agents learn better from their failures than successes?
Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
direct tension: ReasoningBank claims strategy-level distillation works; this paper says consolidation regresses below baseline; resolution may lie in whether applicability conditions are preserved through abstraction
-
Can frozen language models continually improve through memory structure alone?
If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?
CLIN succeeds with causal abstractions; this paper suggests success depends on *what* gets abstracted — causal structure may survive consolidation where heuristic summaries do not
-
Can agents learn from failure without updating their weights?
Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
Reflexion's success may depend on its operating on raw episodes rather than consolidated ones
-
Can three axes replace the short-term long-term memory split?
Does breaking agent memory into forms, functions, and dynamics provide a clearer framework than the traditional short-term/long-term distinction? This matters because current agent-memory literature lacks a unified vocabulary, making comparison between systems nearly impossible.
this paper identifies the evolution operator (consolidation step) as the failure point in the dynamics axis
-
Can agents compress their own memory without losing critical details?
Explores whether agents can autonomously consolidate interaction history into structured memory schemas that reduce token overhead while preserving information needed for long-horizon reasoning and strategic reflection.
productive tension: DeepAgent's autonomous memory folding aims to give agents long-horizon capability through compression-and-strategic-reflection, but this note's inverted-U finding documents that LLM-as-consolidator regresses below the no-memory baseline. The conditions distinguishing safe folding from harmful consolidation are not yet characterized — DeepAgent's structured schema (episodic/working/tool tiers) plus autonomy of timing may avoid the misgrouping/applicability-stripping mechanisms that drive degradation, but this remains an open empirical question. Three-way tension when paired with [[distilling reasoning strategies from both successes and failures outperforms raw trajectories — and creates synergy with test-time scaling]].
-
Is agent memory capacity or quality the real bottleneck?
While more storage seems like the obvious solution to memory problems, what if the real constraint is actually curation—deciding what to keep, discard, and retrieve without degrading performance?
exemplifies: the inverted-U degradation is a concrete instance of the quality/drift failure this note generalizes
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
continuously consolidated agent memory follows an inverted-U utility curve — degrading below the no-memory baseline because consolidation is fragile