Reinforcement Learning for LLMs

Can cumulative rewards teach LLMs multi-step decision making?

Explores whether attributing full episode rewards to each step enables large language models to solve sequential tasks effectively. This matters because current RL methods fail at multi-turn reasoning despite strong single-turn performance.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Existing RL post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. MS-GRPO addresses this through two formal contributions: the Text-Mediated Stochastic Game (TSMG), which models the environment with an explicit text interface, and Language-Agent Policy (LAP), which defines the agent's LLM-based policy.

The credit assignment solution is direct: attribute the entire cumulative episode reward to each individual episode step. This is supplemented by absolute-advantage-weighted episode sampling that improves training performance. The optimization for each step uses only the current state as context, keeping computation manageable.

The conceptual gap bridged here is between communicative acts and operational actions. LLM optimization occurs over sequences of tokens — communicative units rooted in natural language — but effective planning requires selection of actions grounded in the problem domain. This is the distinction between speech acts in dialogue systems and operational actions needed for sequential decision-making.

A 3B parameter model post-trained with MS-GRPO outperforms a 72B parameter baseline by 50% on Frozen Lake, demonstrating that the RL formalization enables massive efficiency gains — the right training framework matters more than model scale for sequential decision-making.

This connects to the broader multi-turn failure pattern. Since Why do language models lose performance in longer conversations?, MS-GRPO suggests the degradation is partly a training gap — models trained with single-turn RL naturally struggle at multi-turn tasks because their training never addressed sequential credit assignment.


Source: Reinforcement Learning

Related concepts in this collection

Concept map
14 direct connections · 140 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

multi-step grpo with cumulative episode reward enables credit assignment in sequential llm decision-making