Can shared-prefix trees reduce redundancy in agent rollouts?
Independent rollouts waste tokens regenerating similar early-turn sequences. Can structuring rollouts as shared-prefix trees instead preserve early computation across samples while maintaining statistical diversity for advantage estimation?
Agent rollouts are expensive. Multi-turn agentic tasks produce trajectories with thousands of tokens and many tool calls per rollout. Group-based RL methods like GRPO sample multiple independent trajectories per task and use the group statistics for advantage estimation. The standard implementation samples each trajectory independently, from the same starting prompt — meaning every trajectory begins by re-generating the same early-turn context.
The redundancy is substantial. If task setup, initial planning, and the first few tool calls are similar across rollouts (often the case, because they all start from the same prompt), then each independent rollout pays the token cost for the early turns again, even though the model would produce nearly the same early sequence each time. The compute is real; the information added per rollout is small in the early turns.
Tree-GRPO restructures this. Rollouts share common prefixes by design — the tree starts as a single trunk and branches at decision points. Compute spent on the trunk is amortized across all leaf trajectories. The same total token budget that produces N independent chain-based rollouts can produce more than N leaf trajectories under tree sampling, because the branches diverge late while sharing the early context.
The empirical consequence is twofold. First, more distinct trajectories per fixed cost means better statistics for advantage estimation — the noise in group-relative comparisons decreases as the effective N grows. Second, the same budget can train on harder tasks where the trajectory length itself was the bottleneck — long trajectories with shared early planning fit into budgets that independent rollouts cannot accommodate.
The pattern generalizes beyond Tree-GRPO. Anywhere RL training samples multiple trajectories from a shared starting point, shared-prefix sampling saves compute. Speculative decoding has the analog at the inference layer. The unifying principle: when starting state is shared, compute up to the divergence point is amortizable.
Related concepts in this collection
-
Can tree structure alone convert outcome rewards into process supervision?
Tree-based rollouts naturally create step-level preference signals by comparing sibling subtrees. Can this structural approach replace separate process reward models without explicit step-level annotation?
same paper, the parent mechanism that makes the budget gain useful
-
Does tree depth automatically produce supervision at multiple granularities?
Tree-search rollouts branch at different depths, potentially creating supervision signals ranging from coarse strategy-level to fine-grained detail-level choices. Does this depth variation naturally yield multi-granular process supervision without explicit annotation design?
same paper, the orthogonal property
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
shared-prefix tree rollouts dramatically expand the effective sample budget for agent RL — same token cost yields more distinct trajectories than independent chain-based rollouts