Why does multi-turn RL generate orders of magnitude more tokens than single-turn?

This explores why training an LLM with reinforcement learning across a back-and-forth task (multi-turn) burns vastly more generated tokens than training it on one-shot prompts — and the corpus answers this less as a single fact than as three compounding mechanisms.

This reads the question as: where does the token blowup actually come from when RL runs over conversations or agent loops instead of single prompts? The corpus doesn't have one note that says 'here's the multiplier,' but three pieces fit together to explain it. The first is simply horizon. Single-turn RL treats a task as one prompt and one graded answer; multi-turn RL operates in stateful, multi-step environments where reward arrives only after a long chain of actions. The work showing RL doubling SWE-bench performance (Can reinforcement learning scale beyond single-turn language tasks?) is explicitly about long-horizon tasks with delayed rewards — every training episode is a whole trajectory of reads, edits, and tool calls, not a single generation. Length per episode is the first order-of-magnitude.

The second mechanism is context accumulation that compounds turn over turn. Each turn doesn't start fresh — it carries the growing transcript of everything generated before it, and the model reasons on top of that. The research on per-turn reasoning budgets (Does limiting reasoning per turn improve multi-turn search quality?) shows that unrestricted reasoning inside a single turn eats the context the agent needs for later retrieval rounds. The flip side of that finding is the cost story: if you don't cap per-turn reasoning, each turn's generation can balloon, and because turns stack, that ballooning is multiplied across the horizon rather than added. Single-turn RL has no later turns to feed, so it never pays this compounding tax.

The third is sampling structure. RL doesn't generate one trajectory per training example — it samples many rollouts to estimate which actions were good. The shared-prefix tree work (Can shared-prefix trees reduce redundancy in agent rollouts?) exists precisely because naive multi-turn rollouts are so token-expensive: independent chains re-generate shared prefixes over and over, and the fix is to branch from common prefixes to get more distinct trajectories per token budget. That this optimization was worth building tells you how steep the baseline cost is — long horizon times wide sampling is multiplicative, which is exactly how you get 'orders of magnitude' rather than 'a bit more.'

There's a quieter implication worth surfacing: most of those tokens aren't where the learning happens. The RLVR work on high-entropy tokens (Do high-entropy tokens drive reasoning model improvements?) finds that only ~20% of tokens are pivotal decision points carrying the training signal, and training on just those matches full updates. So multi-turn RL spends its enormous token budget largely on filler around a small number of forking moments — which is why the efficiency frontier in this area is all about cutting redundant generation (tree rollouts) or limiting it (per-turn budgets) without losing those decisive tokens.

The thing you may not have known you wanted to know: the token explosion isn't a flaw to be eliminated, it's the price of exploration in a long, stateful task — and nearly every recent technique here is a different bet on which of those tokens you can safely stop generating.

Sources 4 notes

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can shared-prefix trees reduce redundancy in agent rollouts?

Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Why does multi-turn RL generate orders of magnitude more tokens than single-turn?

Sources 4 notes

Next inquiring lines