Can reinforcement learning scale beyond single-turn language tasks?
Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
Most RL applications for LLMs have been limited to single-turn tasks — math reasoning, single-shot code generation — which are degenerate MDPs with no intermediate environmental feedback. Software engineering is categorically different: agents must manage stateful, multi-turn interactions across dozens of steps with context windows spanning hundreds of thousands of tokens, interpreting rich feedback (compiler traces, test logs) at each step.
Using a modified DAPO algorithm, training Qwen2.5-72B-Instruct doubles SWE-bench Verified success from a 20% rejection-finetuned baseline to 39%, matching or surpassing larger models like DeepSeek-V3 and Qwen3-235B. The key challenges addressed include long-horizon credit assignment with sparse delayed rewards, complex informative feedback interpretation, and expensive noisy evaluation.
This matters because it validates that RL's benefits extend beyond the "token-level MDP" framing where most current work operates. Since Can cumulative rewards teach LLMs multi-step decision making?, RL for SWE confirms that multi-step credit assignment is not just theoretically sound but practically achievable at scale. And since Does limiting reasoning per turn improve multi-turn search quality?, the SWE result suggests that RL training can learn the step-level discipline that inference-time limiting imposes.
The interaction structure of SWE — actions producing observable transitions and verifiable outcomes — may be what makes RL feasible here, whereas domains without such structure may remain harder to train.
Source: Reinforcement Learning
Related concepts in this collection
-
Can cumulative rewards teach LLMs multi-step decision making?
Explores whether attributing full episode rewards to each step enables large language models to solve sequential tasks effectively. This matters because current RL methods fail at multi-turn reasoning despite strong single-turn performance.
complements: MS-GRPO formalizes sequential credit assignment, SWE validates it at scale
-
Does limiting reasoning per turn improve multi-turn search quality?
When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?
connects: SWE RL learns per-turn discipline through training rather than inference-time limiting
-
Can RL training run while generation continues without waiting?
Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?
enables: AReaL's infrastructure makes this scale of multi-turn RL training practical
-
Why do correct code trajectories teach models to tolerate errors?
Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.
complementary agentic RL challenge: SWE-RL addresses long-horizon credit assignment with sparse rewards, while rStar2-Agent addresses trajectory quality in code-tool environments — both tackle the noise that tool-using RL introduces (SWE-RL through modified DAPO, rStar2 through GRPO-RoC asymmetric filtering)
-
Can AI systems improve themselves through trial and error?
Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.
alternative path to SWE capability: DGM achieves 50% SWE-bench via evolutionary self-modification without RL, while SWE-RL achieves 39% via RL training; DGM's evolutionary archive enables open-ended capability discovery that RL's reward optimization may not explore
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
rl successfully scales to long-horizon multi-turn software engineering tasks doubling baseline performance