The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Paper · arXiv 2509.02547
Reinforcement LearningLLM AgentsTask PlanningReward ModelsAction Models

The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM RL with the temporally extended Partially Observable Markov Decision Processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior.

Agentic Reinforcement Learning (Agentic RL) refers to a paradigm in which LLMs, rather than being treated as static conditional generators optimized for single-turn output alignment or benchmark performance, are conceptualized as learnable policies embedded within sequential decision-making loops, where RL endows them with autonomous agentic capabilities, such as planning, reasoning, tool use, memory maintenance, and self-reflection, enabling the emergence of long-horizon cognitive and interactive behaviors in partially observable, dynamic environments.

Two competing explanations have emerged for why RL appears to boost LLM reasoning. The "amplifier" view holds that RL with verifiable rewards—often instantiated via PPO-style variants such as GRPO—mainly reshapes the base model's output distribution: by sampling multiple trajectories and rewarding the verifiably correct ones, RL concentrates probability mass on already-reachable reasoning paths, improving pass@1 while leaving the support of solutions largely unchanged. By contrast, the "new-knowledge" view argues that RL after next-token prediction can install qualitatively new computation by leveraging sparse outcome-level signals and encouraging longer test-time computation: theory shows that RL enables generalization on problems where next-token training alone is statistically or computationally prohibitive; empirically, RL can improve generalization to out-of-distribution rule- and visual- variants, induce cognitive behaviors (verification, backtracking, subgoal setting) that were absent in the base model yet predict self-improvement, and in under-exposed domains even expand the base model's pass@k frontier. Whether RL can truly endow LLMs with abilities beyond those acquired during pre-training remains an open question, and its underlying learning mechanisms are still to be fully understood.

Agentic RL transforms memory modules from passive data stores into dynamic, RL-controlled subsystems, deciding what to store, when to retrieve, and how to forget similar to humans. Early systems treated memory as an external datastore; when RL was employed at all, it solely regulated when to perform queries. Subsequently, RL was incorporated into the memory management pipeline as a functional component. Subsequent advancements introduced models equipped with explicit, trainable memory controllers, enabling agents to regulate their own memory states without relying on fixed, external memory systems. The most advanced paradigm treats memory itself as an RL-optimizable component, where both the retrieval policy and the memory content are jointly trained to maximize long-horizon task performance.

This survey has charted the emergence of Agentic Reinforcement Learning (Agentic RL), a paradigm that elevates LLMs from passive text generators to autonomous, decision-making agents situated in complex, dynamic worlds. Our journey began by formalizing this conceptual shift, distinguishing the temporally extended and partially observable MDPs (POMDPs) that characterize Agentic RL from the single-step decision processes of conventional RL for LLMs. From this foundation, we constructed a comprehensive, twofold taxonomy to systematically map the field: one centered on core agentic capabilities (planning, tool use, memory, reasoning, self-improvement, perception, etc.) and the other on their application across a diverse array of task domains. Throughout this analysis, our central thesis has been that RL provides the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. By consolidating the landscape of open-source environments, benchmarks, and frameworks, we have also provided a practical compendium to ground and accelerate future research in this burgeoning field.

Each training example consists of two scientific ideas represented by their titles and abstracts [56, 57], with a binary label indicating which one has higher relative citations. We refer to the resulting dataset as SciJudgeBench, which transforms community feedback into pairwise supervision signals, enabling scalable preference learning.