Agentic Systems and Planning Reasoning and Learning Architectures

How does treating LLMs as multi-step agents change what we can optimize?

Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.

Note · 2026-05-18 · sourced from Reinforcement Learning
Why do multi-agent systems fail despite individual capability? How does test-time scaling work at the agent level?

The Agentic RL survey (2509.02547) names what the field has been doing without naming. Conventional RL applied to LLMs treated each prompt-response pair as a degenerate single-step Markov Decision Process — one observation, one action, one terminal reward. This works for math problems and short code generation but fundamentally mismatches what agents actually do: act over many turns in environments that respond, observe partial state, accumulate consequences, recover from errors, refine plans.

Agentic RL reframes the setup as a Partially Observable Markov Decision Process (POMDP). The agent observes only part of the environment state. Actions produce environmental responses that update the agent's belief about state. The reward is sparse and delayed across temporally-extended sequences. The LLM is no longer a static conditional generator — it is a learnable policy embedded within sequential decision-making loops.

The downstream consequence is what makes the reframing load-bearing: all agentic capabilities become RL-optimizable subsystems rather than fixed heuristic modules. The survey makes this concrete for memory specifically. Early systems treated memory as an external datastore — when RL touched it at all, it only regulated query timing. Later, RL was incorporated as a functional component (deciding when to retrieve, when to write). Most recently, the memory itself becomes RL-optimizable: both the retrieval policy and the memory content are jointly trained to maximize long-horizon task performance. The same trajectory applies to planning, tool use, reasoning, self-improvement, and perception.

This survey-level framing is the structural complement to the empirical convergence visible elsewhere in the late-2025 literature. Since Can agents learn continuously from experience without updating weights? — AgentFly's M-MDP makes memory a learnable substrate; ReasoningBank distills strategies under RL; SkillRL evolves a skill library through recursive RL. Each is a specific instance of the general pattern the Agentic RL survey names: RL is the mechanism for transforming any capability from static module into adaptive behavior.

The competing-explanations debate the survey surfaces is also worth carrying: does RL amplify already-reachable reasoning paths (the "amplifier view") or install qualitatively new computation (the "new-knowledge view")? Both have empirical support. The survey's resolution is domain-conditional: RL appears to amplify in standard reasoning settings but expand in under-exposed domains and complex planning tasks. This aligns with Does reinforcement learning create new reasoning abilities or activate existing ones?.

For knowledge architecture: the POMDP framing is the lens through which every late-2025 agent-RL paper should be read.


Paper: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Related concepts in this collection

Concept map
17 direct connections · 112 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

agentic RL reframes LLMs from single-step degenerate MDPs to temporally-extended POMDPs — memory shifts from passive datastore to RL-optimizable subsystem