Can agents learn continuously through memory without updating weights?
Explores whether LLM agents can adapt to new tasks and failures by retrieving and updating past experiences stored in memory, rather than requiring expensive parameter fine-tuning.
AgentFly addresses a central challenge: LLM agents either follow rigid hardcoded workflows (inflexible) or require parameter fine-tuning (expensive, impractical for continual adaptation). The alternative: learn continuously through memory, not weight updates.
The formalization is a Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories as episodic traces — including both successes and failures — and retrieves similar past experiences to guide current decision-making. This aligns with case-based reasoning (CBR), a psychologically grounded learning strategy: humans often solve problems by recalling analogous past situations.
Three memory modules serve distinct functions:
Case Memory — vectorized storage of prior task trajectories (task, plan, success/failure label). Supports retrieval via similarity-based search or an online-updating Q-function. This is the strategic memory: which approaches worked for which kinds of problems.
Subtask Memory — text-based storage of active subtasks and their execution results. Orchestrates the planner-executor interaction within a single task. This is the working memory: what's being done right now.
Tool Memory — text-based logs of tool interactions scoped per subtask. Records what tools were used, what they returned. This is the procedural memory: how specific operations were executed.
The learning mechanism: credit assignment happens via memory rewriting (updating case labels and Q-values based on outcome), and policy improvement happens via memory reading (retrieving relevant cases that shift the planning distribution). No gradient updates to the LLM — the LLM is a fixed reasoning engine, and adaptation happens entirely through what's retrieved into its context.
The result: top-1 on GAIA validation (87.88% Pass@3) and 79.40% on the test set, in the deep research setting.
Since Can agents learn from failure without updating their weights?, AgentFly provides the formal RL framework for this intuition: the M-MDP formalization shows how credit assignment and policy improvement can operate entirely through memory operations. The Q-function over cases provides a principled retrieval policy that improves with experience, rather than relying on static similarity-based retrieval.
Source: Memory
Related concepts in this collection
-
Can agents learn from failure without updating their weights?
Explores whether language models can improve through trial-and-error by storing reflections in memory rather than through gradient-based parameter updates. Tests if environmental feedback alone can drive learning.
AgentFly adds M-MDP formalization: credit assignment via memory rewriting, policy improvement via memory reading
-
Can agents learn continuously without forgetting old skills?
Can lifelong learning systems retain previously acquired skills while acquiring new ones? This explores whether externalizing learned behaviors as retrievable code programs rather than parameter updates solves catastrophic forgetting.
VOYAGER composes skills; AgentFly composes cases. Both achieve continual learning without parameter updates
-
Can 78 demonstrations teach agency better than 10000?
Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.
AgentFly's case bank grows from experience; the efficiency principle suggests a small number of high-quality cases may suffice
-
How do agentic AI systems decompose into adaptation paradigms?
What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.
AgentFly is agent-optimized with execution-signaled feedback via memory rewriting
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
memory-based online reinforcement learning enables continual agent adaptation without fine-tuning through episodic case-based reasoning