Agentic Systems and Planning Reasoning and Learning Architectures

How does treating LLMs as multi-step agents change what we can optimize?

Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.

Note · 2026-05-18 · sourced from Reinforcement Learning

The Agentic RL survey (2509.02547) names what the field has been doing without naming. Conventional RL applied to LLMs treated each prompt-response pair as a degenerate single-step Markov Decision Process — one observation, one action, one terminal reward. This works for math problems and short code generation but fundamentally mismatches what agents actually do: act over many turns in environments that respond, observe partial state, accumulate consequences, recover from errors, refine plans.

Agentic RL reframes the setup as a Partially Observable Markov Decision Process (POMDP). The agent observes only part of the environment state. Actions produce environmental responses that update the agent's belief about state. The reward is sparse and delayed across temporally-extended sequences. The LLM is no longer a static conditional generator — it is a learnable policy embedded within sequential decision-making loops.

The downstream consequence is what makes the reframing load-bearing: all agentic capabilities become RL-optimizable subsystems rather than fixed heuristic modules. The survey makes this concrete for memory specifically. Early systems treated memory as an external datastore — when RL touched it at all, it only regulated query timing. Later, RL was incorporated as a functional component (deciding when to retrieve, when to write). Most recently, the memory itself becomes RL-optimizable: both the retrieval policy and the memory content are jointly trained to maximize long-horizon task performance. The same trajectory applies to planning, tool use, reasoning, self-improvement, and perception.

This survey-level framing is the structural complement to the empirical convergence visible elsewhere in the late-2025 literature. Since Can agents learn continuously from experience without updating weights? — AgentFly's M-MDP makes memory a learnable substrate; ReasoningBank distills strategies under RL; SkillRL evolves a skill library through recursive RL. Each is a specific instance of the general pattern the Agentic RL survey names: RL is the mechanism for transforming any capability from static module into adaptive behavior.

The competing-explanations debate the survey surfaces is also worth carrying: does RL amplify already-reachable reasoning paths (the "amplifier view") or install qualitatively new computation (the "new-knowledge view")? Both have empirical support. The survey's resolution is domain-conditional: RL appears to amplify in standard reasoning settings but expand in under-exposed domains and complex planning tasks. This aligns with Does reinforcement learning create new reasoning abilities or activate existing ones?.

For knowledge architecture: the POMDP framing is the lens through which every late-2025 agent-RL paper should be read.

Paper: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Related concepts in this collection

Can agents learn continuously from experience without updating weights? This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.
AgentFly is the M-MDP instantiation of agentic-RL's memory-as-optimizable claim
Can agents learn better from their failures than successes? Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank instantiates the strategic-memory-as-RL-target pattern
Should successful and failed episodes be processed differently? Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.
SkillRL instantiates the skill-library-as-RL-target pattern with differential processing
Does reinforcement learning create new reasoning abilities or activate existing ones? RL post-training might either unlock latent capabilities in base models or genuinely create novel strategies. Understanding which happens under what conditions clarifies how to invest in model training effectively.
resolution of the amplifier-vs-new-knowledge debate the Agentic RL survey foregrounds
Can three axes replace the short-term long-term memory split? Does breaking agent memory into forms, functions, and dynamics provide a clearer framework than the traditional short-term/long-term distinction? This matters because current agent-memory literature lacks a unified vocabulary, making comparison between systems nearly impossible.
the Memory survey provides the taxonomy that fits inside the Agentic RL POMDP frame

Concept map

17 direct connections · 112 in 2-hop network ·medium cluster Open in graph ↗

How does treating LLMs as multi-step agents chan… Can agents learn continuously from experience with… Can agents learn better from their failures than s… Should successful and failed episodes be processed… Does reinforcement learning create new reasoning a… Can three axes replace the short-term long-term me…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

agentic RL reframes LLMs from single-step degenerate MDPs to temporally-extended POMDPs — memory shifts from passive datastore to RL-optimizable subsystem

How does treating LLMs as multi-step agents change what we can optimize?

Related concepts in this collection

Related papers in this collection