Reinforcement Learning for LLMs

Can reinforcement learning scale beyond single-turn language tasks?

Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Most RL applications for LLMs have been limited to single-turn tasks — math reasoning, single-shot code generation — which are degenerate MDPs with no intermediate environmental feedback. Software engineering is categorically different: agents must manage stateful, multi-turn interactions across dozens of steps with context windows spanning hundreds of thousands of tokens, interpreting rich feedback (compiler traces, test logs) at each step.

Using a modified DAPO algorithm, training Qwen2.5-72B-Instruct doubles SWE-bench Verified success from a 20% rejection-finetuned baseline to 39%, matching or surpassing larger models like DeepSeek-V3 and Qwen3-235B. The key challenges addressed include long-horizon credit assignment with sparse delayed rewards, complex informative feedback interpretation, and expensive noisy evaluation.

This matters because it validates that RL's benefits extend beyond the "token-level MDP" framing where most current work operates. Since Can cumulative rewards teach LLMs multi-step decision making?, RL for SWE confirms that multi-step credit assignment is not just theoretically sound but practically achievable at scale. And since Does limiting reasoning per turn improve multi-turn search quality?, the SWE result suggests that RL training can learn the step-level discipline that inference-time limiting imposes.

The interaction structure of SWE — actions producing observable transitions and verifiable outcomes — may be what makes RL feasible here, whereas domains without such structure may remain harder to train.


Source: Reinforcement Learning

Related concepts in this collection

Concept map
15 direct connections · 145 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl successfully scales to long-horizon multi-turn software engineering tasks doubling baseline performance