Can cumulative rewards teach LLMs multi-step decision making?
Explores whether attributing full episode rewards to each step enables large language models to solve sequential tasks effectively. This matters because current RL methods fail at multi-turn reasoning despite strong single-turn performance.
Existing RL post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. MS-GRPO addresses this through two formal contributions: the Text-Mediated Stochastic Game (TSMG), which models the environment with an explicit text interface, and Language-Agent Policy (LAP), which defines the agent's LLM-based policy.
The credit assignment solution is direct: attribute the entire cumulative episode reward to each individual episode step. This is supplemented by absolute-advantage-weighted episode sampling that improves training performance. The optimization for each step uses only the current state as context, keeping computation manageable.
The conceptual gap bridged here is between communicative acts and operational actions. LLM optimization occurs over sequences of tokens — communicative units rooted in natural language — but effective planning requires selection of actions grounded in the problem domain. This is the distinction between speech acts in dialogue systems and operational actions needed for sequential decision-making.
A 3B parameter model post-trained with MS-GRPO outperforms a 72B parameter baseline by 50% on Frozen Lake, demonstrating that the RL formalization enables massive efficiency gains — the right training framework matters more than model scale for sequential decision-making.
This connects to the broader multi-turn failure pattern. Since Why do language models lose performance in longer conversations?, MS-GRPO suggests the degradation is partly a training gap — models trained with single-turn RL naturally struggle at multi-turn tasks because their training never addressed sequential credit assignment.
Source: Reinforcement Learning
Related concepts in this collection
-
Why do language models lose performance in longer conversations?
Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
extends: multi-turn failures may also be a training formulation gap that MS-GRPO addresses
-
Does limiting reasoning per turn improve multi-turn search quality?
When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?
complements: MS-GRPO provides the training framework for what per-turn limiting addresses at inference
-
Why do language models respond passively instead of asking clarifying questions?
Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.
supports: MS-GRPO's cumulative episode reward is exactly the multi-turn-aware reward called for
-
Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
addresses the training root: models get lost because single-turn RL training never teaches sequential credit assignment; MS-GRPO's cumulative episode reward directly targets the premature-commitment failure by attributing multi-step outcomes to earlier decisions
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
multi-step grpo with cumulative episode reward enables credit assignment in sequential llm decision-making