How does memory folding enable agents to reconsider strategies mid-task?
This explores how an agent compressing its own interaction history into structured memory (DeepAgent's "memory folding") creates the room — and the vantage point — to step back and rethink its approach partway through a task.
This explores how an agent compressing its own interaction history into structured memory creates room to pause and rethink its approach partway through a task. The direct answer lives in DeepAgent's memory folding Can agents compress their own memory without losing critical details?: rather than dragging an ever-growing transcript forward, the agent consolidates what happened into separate schemas — episodic (what occurred), working (what's relevant now), and tool memory (what actions are available). Two things make this enable mid-task reconsideration. First, folding frees up the token budget that a raw history would otherwise consume, so the agent has headroom to plan instead of just react. Second, and more subtly, the act of summarizing forces a structured view of the trajectory so far — and that structured view is exactly what you need to ask "is this strategy working?" rather than blindly continuing.
Why structure matters becomes clearer when you look at how memory gets organized elsewhere. RAISE breaks agent memory into four components across two time scales — conversation-level versus turn-level How should agent memory split across time scales? — and the lesson is that different kinds of memory want different update rules. Memory folding is the same intuition applied to reconsideration: the working layer can be revised aggressively while episodic facts stay stable, so an agent can change course without losing the record of what it already tried.
The corpus also flags a real risk that folding is designed to dodge: naive compression degrades. SkillRL shows that you shouldn't fold successes and failures the same way — successful episodes are kept as concrete demonstrations, while failures are abstracted into lessons Should successful and failed episodes be processed differently?. That asymmetry is what makes a fold useful for strategy revision: a failed branch shouldn't just vanish into a token-saving summary, it should become a reusable "don't do this" signal. Reflexion makes the complementary point — it deliberately keeps verbal self-reflections uncompressed, because squeezing them too hard destroys the very diagnosis the agent needs to improve Can agents learn from failure without updating their weights?. So folding is a balancing act: compress the bulk, but preserve the load-bearing reflections.
There's a deeper reason mid-task reconsideration is worth engineering for. When agents train purely on numerical reward, they tend to collapse onto a single narrow strategy — entropy collapse squeezes out exploration in both reasoning and search agents Does reinforcement learning squeeze exploration diversity in search agents?. The fix that keeps showing up is richer feedback: natural-language critiques break through plateaus that numbers alone can't, because they carry information about *why* something failed Can natural language feedback overcome numerical reward plateaus?. A folded episodic memory is essentially a place to store that kind of verbal "why" — making reconsideration possible at inference time, without touching the model's weights, the same way AgentFly improves entirely through memory operations Can agents learn continuously from experience without updating weights?.
The thing you might not have known you wanted to know: research on RL training dynamics suggests *when* strategy reconsideration even becomes the bottleneck. Training tends to move through two phases — first nailing execution correctness, then shifting to strategic planning, with planning-token entropy rising in that second phase Does RL training follow a predictable two-phase learning sequence?. Memory folding is most valuable precisely in that second regime: once an agent can reliably *do* the steps, the leverage moves to *which* steps, and a compact, structured memory of the run so far is what lets it reopen that question mid-task instead of committing to its first plan.
Sources 8 notes
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.