Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

Paper · arXiv 2508.03501 · Published August 5, 2025

Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a nontrivial observation.

To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent’s success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models.

The interactive, structured nature of SWE, where actions produce observable transitions and verifiable outcomes, makes it an ideal domain for RL. Yet, to date, most RL applications for LLMs have been limited to single-turn tasks, such as math reasoning or single-shot code generation, which can be trivially modeled as multi-armed bandits or degenerate MDPs with no intermediate environmental feedback (Figure 2).

In contrast, SWE scenarios require agents to manage stateful, multi-turn interactions. Successfully applying RL in this context involves several key challenges:

• Long-horizon, multi-turn interaction: Agents must maintain coherence across dozens of steps with context windows spanning hundreds of thousands of tokens.

• Complex, informative feedback: Actions elicit rich outputs (e.g., compiler traces, test logs) that must be interpreted to guide subsequent decisions effectively.

• Data scalability and fidelity: Generating high-quality trajectories requires the reproduction of specific repository states in controlled environments, which limits dataset scale. Large-scale datasets such as SWESMITH (Yang et al. 2025) and SWE-REBENCH (Badertdinov et al. 2025) begin to address this gap; we primarily build on the latter.

• Sparse, delayed rewards: Success signals typically emerge only at the end of long action sequences, complicating credit assignment.

• Expensive and noisy evaluation: Unrolling trajectories and subsequent evaluation are costly, and flakiness in tests introduces noise in the reward signal. In this paper, we address these challenges by developing a complete RL pipeline tailored explicitly to interactive SWE tasks. Our core contributions include:

• A scalable RL framework, based on a modified DAPO algorithm (Yu et al. 2025), specifically adapted to handle the demands of long-horizon, multi-turn SWE scenarios.

• Empirical demonstration of RL effectiveness by training a Qwen2.5-72B-Instruct agent that achieves approximately 39% success on the SWE-BENCH VERIFIED benchmark, doubling the baseline performance of a rejection-fine-tuned agent. Moreover, our agent matches or surpasses top open-weight models, such as DeepSeek-V3-0324 and Qwen3-235B-A22B, on the SWE-REBENCH benchmark (Table 1).

• Detailed analysis of our RL training methodology, including algorithmic modifications, hyperparameter settings, key findings and discussion of promising future directions for applying RL to LLM-based agents in interactive, stateful environments.

Our results show that RL can be successfully applied to general, interactive environments, advancing the capabilities of open-weight LLM agents beyond single-turn benchmarks and toward real-world scenarios, where future autonomous agents are expected to operate.