Reinforcement Learning for LLMs LLM Reasoning and Architecture Agentic and Multi-Agent Systems

Can agents learn from failure without updating their weights?

Explores whether language models can improve through trial-and-error by storing reflections in memory rather than through gradient-based parameter updates. Tests if environmental feedback alone can drive learning.

Note · 2026-02-22 · sourced from Reasoning by Reflection

Reflexion demonstrates a specific version of the external-feedback principle at system scale: when an agent has access to unambiguous binary feedback from the environment (success = 1, failure = 0), it can write verbal reflections summarizing what went wrong and how to avoid it. These reflections persist in episodic memory across episodes. The agent improves not through gradient descent but through memory accumulation.

The binary reward design is deliberate and consequential. A richer reward model would allow the agent to rationalize partial performance — finding reasons why a partial failure was acceptable. The binary signal eliminates this: the environment says success or failure, with no room for self-serving gradations. The model must genuinely diagnose what went wrong to write a useful reflection.

Two hallucination types receive precise operational definitions: consecutive identical actions in an environment that responded identically (stuck loop) and trajectories exceeding 30 actions without reaching a successful state (inefficient planning). Both are detectable signatures that trigger termination and reflection, rather than indefinite continuation.

The method requires two components: a heuristic for when to terminate and trigger reflection, and a binary reward signal from the environment. This is a low-data-requirement architecture: no fine-tuning, no labeled training set, just a success/fail signal and the model's ability to generate natural language diagnoses.

The key distinction from internal self-revision: Reflexion's reflection is grounded in actual environmental outcomes, not the model's assessment of its own outputs. This is why it works where internal self-assessment does not. The environment provides an independent ground truth the model cannot rationalize away.

AgentFly M-MDP formalization (2508.16153): AgentFly extends episodic memory-based learning into a formal RL framework — the Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories (successes and failures) in three specialized memory modules: case memory (vectorized prior trajectories with Q-values for retrieval), subtask memory (active tasks and results), and tool memory (per-subtask tool interaction logs). Credit assignment occurs via memory rewriting (updating case labels and Q-values based on outcomes), and policy improvement occurs via memory reading (retrieving relevant cases shifts the planning distribution). The Q-function over cases provides a principled retrieval policy that improves with experience — moving beyond Reflexion's simpler similarity-based episodic retrieval toward learned case selection. AgentFly achieves top-1 on GAIA validation (87.88% Pass@3) in the deep research setting, demonstrating that memory-based RL can match or exceed fine-tuning-based approaches. See Can agents learn continuously through memory without updating weights?.

Source: Reasoning by Reflection; enriched from Memory

Related concepts in this collection

Does revising your own reasoning actually help or hurt? Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
Reflexion is the working prototype of this principle: environment = external critic, binary reward = unambiguous signal
Do prior errors in context history amplify future errors? When a language model makes mistakes early in a task, do those errors contaminate subsequent predictions? We explore whether error accumulation degrades long-horizon performance through passive context pollution rather than capability limits.
Reflexion works against this: episodic memory provides targeted failure analysis rather than accumulating raw error history that amplifies future errors
Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
Reflexion is an architectural solution to degeneration-of-thought: by grounding reflection in binary environmental outcomes rather than self-assessment, it avoids the pattern where internal self-revision amplifies confidence in wrong answers

Concept map

18 direct connections · 137 in 2-hop network ·medium cluster

Can agents learn from failure without updating t… Does revising your own reasoning actually help or … Do prior errors in context history amplify future … Does a model improve by arguing with itself?

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

verbal reflection stored as episodic memory lets agents learn from trial and error without parameter updates — the environment is the teacher