Why do different agent memory architectures make incompatible granularity claims?
This explores why agent memory papers keep disagreeing about the 'right' level of memory abstraction (workflow vs. step, dialogue vs. turn, episodic vs. consolidated) — and argues the disagreement is mostly an artifact of each study optimizing for a different kind of task.
This explores why agent memory papers keep landing on different 'best' granularities — and the short answer is that they're each measuring a different domain, so their claims only look incompatible. The clearest statement of this is that memory granularity is domain-conditional Does agent memory work better at one level of abstraction?: workflow-level memory wins where tasks are routine and vary mostly by arguments, causal-rule memory wins where the environment is the source of variance, and fine-grained state-action memory wins for spatially-rich web tasks. A paper that benchmarked on web navigation and a paper that benchmarked on routine automation will reach opposite conclusions about abstraction — not because one is wrong, but because the optimal level tracks where the task's variance lives.
A second source of apparent conflict is that 'granularity' isn't one axis. RAISE splits working memory into four components organized along two separate axes — dialogue-level (conversation history, scratchpad) versus turn-level (examples, trajectory) How should agent memory split across time scales?. So one architecture's 'coarse' memory and another's 'fine' memory may not even be talking about the same dimension. Add to this that memory management itself bifurcates — an explicit hot path where the agent decides via tool calls, and an implicit background path triggered programmatically How should agents decide what memories to keep? — and you get architectures that draw their granularity lines in completely different places by design.
The more interesting reason, though, is that finer-grained consolidation isn't strictly better, so there's a real tension being argued over, not just a vocabulary mismatch. Continuously consolidating memory follows an inverted-U: aggressive abstraction eventually performs worse than just keeping raw episodes, with one model failing 54% of previously-solved problems after consolidation through misgrouping, applicability-stripping, and overfitting Does agent memory degrade when continuously consolidated?. That's why some architectures defensively keep memory episodic and low-abstraction while others push for structured schemas Can agents compress their own memory without losing critical details? — they're sitting on opposite sides of the same curve, and each can produce evidence for its position.
What dissolves the whole debate is the argument that granularity is the wrong thing to fix at all. FluxMem reframes memory effectiveness as a connectivity problem — usefulness comes from links between co-activated units forming a reachable subgraph, not from what level things are stored at Is agent memory a storage problem or a connectivity problem? — and shows that letting topology form, refine, and prune through execution feedback beats any fixed scheme by aligning abstraction dynamically Should agent memory adapt dynamically based on execution feedback?. Seen this way, a static granularity claim is just a frozen snapshot of an abstraction level that should be moving. This connects to the broader finding that the real bottleneck was never storage capacity or even the level of detail, but curation — staleness, drift, and over-generalization are what actually degrade performance Is agent memory capacity or quality the real bottleneck?.
So the incompatible claims aren't a literature in disarray. They're what you get when researchers fix a single granularity, test it on one domain's variance structure, and report the optimum — when the actual lesson across the corpus is that the optimum is conditional, multi-axis, curve-shaped, and ideally not fixed at all.
Sources 8 notes
Workflow-level memory wins in routine-rich domains, causal-rule memory in environment-rich domains, and state-action memory in spatially-rich web tasks. The optimal abstraction depends on whether task variance comes from arguments, causal structure, or fine-grained UI state.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
Memory management decomposes into explicit hot-path (agent decides via tool calling) and implicit background (programmatically triggered) paths. Each approach trades context-sensitivity for reliability differently across generation, storage, retrieval, and deletion.
LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
FluxMem shows that memory usefulness is determined by links between co-activated units forming an accessible subgraph, not by what is stored. Storage is necessary but inert; topology determines whether useful memories are reachable at decision time.
FluxMem demonstrates that adaptive memory topology—where links form, refine, and consolidate based on closed-loop execution feedback—consistently reaches state-of-the-art across three distinct benchmarks. Dynamic connectivity outperforms fixed retrieval by aligning abstraction and eliminating interference.
The core challenge in agent memory is not accumulating more data but managing what exists—preventing staleness, drift, contamination, and over-generalization. Adding capacity without curation actively makes performance worse.