Can multi-turn reinforcement learning improve tool use in language models?
This explores whether reinforcement learning that optimizes across many conversational turns — rather than one response at a time — can make models better at the multi-step work of calling tools, running searches, and acting in an environment.
This explores whether multi-turn RL can improve tool use, and the corpus answers with a qualified yes — but the qualification is where the interesting story lives. The most direct evidence is that RL genuinely scales past single-turn tasks: a modified DAPO training run doubled SWE-bench Verified performance from 20% to 39%, showing RL works in stateful, multi-step environments where a model has to edit files, run tests, and act on delayed, messy feedback Can reinforcement learning scale beyond single-turn language tasks?. Tool use is exactly this kind of long-horizon problem, so the headline answer is encouraging.
But the corpus also pinpoints *why* naive RL quietly sabotages multi-turn behavior. Standard RLHF optimizes for the immediate next-turn reward — be maximally helpful right now — which trains models to respond passively instead of asking clarifying questions or planning ahead. Swapping in rewards that estimate the long-term value of an interaction is what unlocks active intent discovery and genuine collaboration Why do language models respond passively instead of asking clarifying questions?. The lesson transfers directly to tools: a model rewarded only for its next action will grab the first plausible tool call rather than reasoning about a sequence of them. That failure has a measured cost — across 200,000+ conversations, models lock into premature wrong assumptions and lose 39% of their performance in multi-turn settings, with patches recovering only 15–20% Why do language models fail in gradually revealed conversations?.
What actually helps, beyond the reward horizon, is managing the model's *budget* across turns. Letting a model reason without limit inside a single search turn eats the context it needs for later retrieval rounds; capping reasoning per turn — not just overall — preserves the ability to absorb new evidence across an iterative tool-using loop Does limiting reasoning per turn improve multi-turn search quality?. So multi-turn competence is partly a reward-design problem and partly a context-management problem, and good tool-use training has to solve both.
There's also a non-RL cousin worth knowing about. Reflexion shows agents can improve at multi-step tasks *without* updating weights at all, by storing verbal self-critiques in episodic memory and learning from clean binary success/failure signals across episodes Can agents learn from failure without updating their weights?. That's a useful contrast: where multi-turn RL bakes the lesson into parameters, reflective memory keeps it external — and the binary feedback that powers Reflexion is the same kind of unambiguous environmental signal that makes RL on tool-use tractable in the first place.
Finally, the corpus hints at where the reward signals themselves come from. RL can train directly on rule-based, environment-native metrics — recommendation scores like NDCG and Recall used as black-box rewards, no human-labeled SFT data required Can recommendation metrics train language models directly? — and RL that rewards reasoning quality, not just final-token correctness, internalizes coherent knowledge better than supervised fine-tuning Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Read together, these suggest the real frontier isn't whether multi-turn RL *can* improve tool use, but whether you can design a reward that values the whole trajectory — the right sequence of tool calls and the reasoning between them — rather than the next move alone.
Sources 7 notes
Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.