Reinforced Language Models for Sequential Decision Making

Paper · arXiv 2508.10839 · Published August 14, 2025
Reinforcement LearningTasks PlanningRLVR

Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for singleturn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi- Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language- Agent Policy (LAP) frameworks. For credit assignment, MSGRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task.

Furthermore, a conceptual limitation arises when using LLMs as decision-making agents: the optimization occurs over sequences of tokens, which are communicative units rooted in natural language, whereas effective planning requires the selection of actions grounded in the problem domain (e.g. navigation moves in a spatial environment). This discrepancy mirrors the distinction between communicative acts, such as speech acts in dialogue systems (Traum 1999), and operational actions needed for sequential decision-making (Georgeff 1988). Bridging this gap calls for new methods that formally align the language-centric outputs of LLMs with the structured, domain-specific actions required for agent planning and control.

Against this background, for the first time, we:

  1. Define a formal framework connecting language-based agents and sequential decision-making environments, comprising the Text-Mediated Stochastic Game (TMSG), which models the environment with an explicit text interface, and Language Agent Policy (LAP), which defines the agent’s LLM-based policy.

  2. Introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), an algorithm adapting the GRPO method for sequential decision-making tasks by assigning the entire cumulative episode reward to each individual step. To improve efficiency, the optimization for each step uses only the current state as context.

  3. Propose a novel absolute-advantage-weighted (AAW) episode sampling strategy which we demonstrate improves training performance.