Reinforcement Learning for LLMs LLM Reasoning and Architecture Agentic and Multi-Agent Systems

Can agents learn from failure without updating their weights?

Explores whether language models can improve through trial-and-error by storing reflections in memory rather than through gradient-based parameter updates. Tests if environmental feedback alone can drive learning.

Note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Reflexion demonstrates a specific version of the external-feedback principle at system scale: when an agent has access to unambiguous binary feedback from the environment (success = 1, failure = 0), it can write verbal reflections summarizing what went wrong and how to avoid it. These reflections persist in episodic memory across episodes. The agent improves not through gradient descent but through memory accumulation.

The binary reward design is deliberate and consequential. A richer reward model would allow the agent to rationalize partial performance — finding reasons why a partial failure was acceptable. The binary signal eliminates this: the environment says success or failure, with no room for self-serving gradations. The model must genuinely diagnose what went wrong to write a useful reflection.

Two hallucination types receive precise operational definitions: consecutive identical actions in an environment that responded identically (stuck loop) and trajectories exceeding 30 actions without reaching a successful state (inefficient planning). Both are detectable signatures that trigger termination and reflection, rather than indefinite continuation.

The method requires two components: a heuristic for when to terminate and trigger reflection, and a binary reward signal from the environment. This is a low-data-requirement architecture: no fine-tuning, no labeled training set, just a success/fail signal and the model's ability to generate natural language diagnoses.

The key distinction from internal self-revision: Reflexion's reflection is grounded in actual environmental outcomes, not the model's assessment of its own outputs. This is why it works where internal self-assessment does not. The environment provides an independent ground truth the model cannot rationalize away.

AgentFly M-MDP formalization (2508.16153): AgentFly extends episodic memory-based learning into a formal RL framework — the Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories (successes and failures) in three specialized memory modules: case memory (vectorized prior trajectories with Q-values for retrieval), subtask memory (active tasks and results), and tool memory (per-subtask tool interaction logs). Credit assignment occurs via memory rewriting (updating case labels and Q-values based on outcomes), and policy improvement occurs via memory reading (retrieving relevant cases shifts the planning distribution). The Q-function over cases provides a principled retrieval policy that improves with experience — moving beyond Reflexion's simpler similarity-based episodic retrieval toward learned case selection. AgentFly achieves top-1 on GAIA validation (87.88% Pass@3) in the deep research setting, demonstrating that memory-based RL can match or exceed fine-tuning-based approaches. See Can agents learn continuously through memory without updating weights?.


Source: Reasoning by Reflection; enriched from Memory

Related concepts in this collection

Concept map
18 direct connections · 137 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

verbal reflection stored as episodic memory lets agents learn from trial and error without parameter updates — the environment is the teacher