Agentic and Multi-Agent Systems

Can agents learn continuously through memory without updating weights?

Explores whether LLM agents can adapt to new tasks and failures by retrieving and updating past experiences stored in memory, rather than requiring expensive parameter fine-tuning.

Note · 2026-02-23 · sourced from Memory
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

AgentFly addresses a central challenge: LLM agents either follow rigid hardcoded workflows (inflexible) or require parameter fine-tuning (expensive, impractical for continual adaptation). The alternative: learn continuously through memory, not weight updates.

The formalization is a Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories as episodic traces — including both successes and failures — and retrieves similar past experiences to guide current decision-making. This aligns with case-based reasoning (CBR), a psychologically grounded learning strategy: humans often solve problems by recalling analogous past situations.

Three memory modules serve distinct functions:

  1. Case Memory — vectorized storage of prior task trajectories (task, plan, success/failure label). Supports retrieval via similarity-based search or an online-updating Q-function. This is the strategic memory: which approaches worked for which kinds of problems.

  2. Subtask Memory — text-based storage of active subtasks and their execution results. Orchestrates the planner-executor interaction within a single task. This is the working memory: what's being done right now.

  3. Tool Memory — text-based logs of tool interactions scoped per subtask. Records what tools were used, what they returned. This is the procedural memory: how specific operations were executed.

The learning mechanism: credit assignment happens via memory rewriting (updating case labels and Q-values based on outcome), and policy improvement happens via memory reading (retrieving relevant cases that shift the planning distribution). No gradient updates to the LLM — the LLM is a fixed reasoning engine, and adaptation happens entirely through what's retrieved into its context.

The result: top-1 on GAIA validation (87.88% Pass@3) and 79.40% on the test set, in the deep research setting.

Since Can agents learn from failure without updating their weights?, AgentFly provides the formal RL framework for this intuition: the M-MDP formalization shows how credit assignment and policy improvement can operate entirely through memory operations. The Q-function over cases provides a principled retrieval policy that improves with experience, rather than relying on static similarity-based retrieval.


Source: Memory

Related concepts in this collection

Concept map
14 direct connections · 88 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

memory-based online reinforcement learning enables continual agent adaptation without fine-tuning through episodic case-based reasoning