Why does credit assignment through memory rewriting avoid expensive LLM parameter updates?
This explores how some agents learn from their successes and failures by editing an external memory store instead of retraining the model's weights — and why that sidesteps the cost of parameter updates.
This explores how some agents learn from their successes and failures by editing an external memory store instead of retraining the model's weights — and why that sidesteps the cost of parameter updates. The cleanest answer in the corpus comes from work that reframes agent learning as operations on memory rather than gradient descent on the model. In AgentFly, learning is formalized as a 'Memory-augmented MDP': the system keeps case, subtask, and tool memories, and credit assignment — figuring out which past actions deserve reward — happens by rewriting those memories rather than by backpropagating through billions of frozen parameters Can agents learn continuously from experience without updating weights?. Because the LLM is treated as a fixed reasoning engine and all the adaptation lives in retrievable memory, the agent improves continuously without ever paying for a weight update, reaching ~88% on GAIA with the base model untouched.
Why is that so much cheaper? A parallel note reframes the long-context bottleneck not as a memory-capacity problem but as a *compute* problem: turning experience into a model's internal 'fast weights' requires expensive consolidation passes, and performance scales with how many of those passes you run Is long-context bottleneck really about memory or compute?. Memory rewriting simply refuses to pay that consolidation tax. Instead of compressing experience back into the network, it leaves the experience in an external store the model reads at inference time — trading an expensive write into weights for a cheap write into a database.
The credit-assignment piece is worth separating from the storage piece. Another note shows that good credit assignment doesn't inherently require touching the LLM at all — MS-GRPO assigns full episode reward to each step and uses group-relative normalization across rollouts to surface which action sequences actually worked Can full episode rewards per step enable better credit assignment?. That's the same conceptual move memory rewriting makes: the signal about what to keep or discard is computed *over traces of behavior*, and where you store the resulting lesson (a normalized advantage vs. a memory entry) is a separate design choice. Memory-based RL keeps the lesson outside the weights.
There's a real ceiling here, though, and the corpus is honest about it. Self-improvement of any kind is bounded by the generation-verification gap — a model can't reliably fix itself without something external to validate the fix What stops large language models from improving themselves?. Memory rewriting is appealing precisely because the memory store *is* that external scaffold: it accumulates verified outcomes the model couldn't have derived through introspection alone. This echoes the broader 'LLM as a component inside an explicit program' pattern, where control flow, state, and now memory live outside the model and the LLM is invoked only for step-specific reasoning Can algorithms control LLM reasoning better than LLMs alone?.
The thing you might not have expected to learn: avoiding parameter updates isn't only a cost optimization — it can be a *robustness* one. A model fine-tuned into new weights risks catastrophic interference, and frontier models already corrupt a quarter of document content over long delegated workflows as small errors compound silently Do frontier LLMs silently corrupt documents in long workflows?. Keeping learned experience in an inspectable, editable memory means the lesson is auditable and reversible in a way a weight change never is — you can read what the agent 'learned,' and delete it if it's wrong.
Sources 6 notes
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
MS-GRPO assigns cumulative episode reward to each step, and group-relative normalization across rollouts surfaces which action sequences succeed. A 3B model post-trained this way outperforms 72B baselines by 50%, showing the training method matters more than scale for multi-step tasks.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.