ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Paper · arXiv 2509.25140
LLM MemoryLLM AgentsReasoning ArchitecturesReinforcement Learning

With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent's interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise.

Memory has emerged as an essential module in modern agent systems to enhance their performance by utilizing past information. Existing memory systems organize and store information in various forms, including plain text, latent embeddings, and structured graphs. Regarding memory content, those methods usually involve retrieval mechanisms (e.g., semantic search) with memory management strategies (e.g., updating). More recently, with the growing development of reinforcement learning (RL) in LLM agents, RL has also been leveraged for memory management in agent systems. While most efforts primarily emphasize personalization and long-context management, this paper falls in the research line of learning from past experiences as memory, which is a critical aspect for developing self-evolving agent systems. Different from previous works that emphasize reusing successful trajectories, procedural workflows, or instance-level concepts, ReasoningBank stores high-level strategies and reasoning hints. By abstracting experiences into reusable reasoning units, ReasoningBank enables agents to generalize not only from successful cases but also by learning from failures, providing richer guidance for test-time learning.

ReasoningBank distills and organizes memory items from both successful and failed experiences judged by the agent itself without ground-truth labels. It captures not only effective strategies from successes but also crucial preventative lessons from failures, abstracting them into a collection of actionable principles. This process operates in a closed loop: when facing a new task, the agent retrieves relevant memories from ReasoningBank to guide its actions. Afterward, the new experience is analyzed, distilled, and consolidated back into the ReasoningBank, allowing the agent to continuously evolve and improve its strategic capabilities.

We introduce ReasoningBank, a memory framework that distills strategy-level reasoning signals from both successes and failures and integrates them into test-time scaling (MaTTS). Extensive experiments show that ReasoningBank consistently improves performance while reducing redundant exploration. Further results reveal a strong synergy between memory and scaling: ReasoningBank guides scaling toward more promising rollouts, while diverse rollouts enrich memory with valuable contrastive signals. We also provide analyses of individual components and emergent behaviors. Our findings suggest a practical pathway toward building adaptive and lifelong-learning agents, with additional future directions and limitations in Appendix D and E.