Can tree structure alone convert outcome rewards into process supervision?
Tree-based rollouts naturally create step-level preference signals by comparing sibling subtrees. Can this structural approach replace separate process reward models without explicit step-level annotation?
Agent RL with outcome-only rewards faces a sparse-supervision problem at long horizons. Multi-turn trajectories with thousands of tokens and many tool calls produce trajectory-level reward signals that cannot identify which specific steps contributed to success or failure. The standard responses — process reward models trained separately, dense intermediate rewards from human annotation — each have costs that limit deployment.
Tree-based Group Relative Policy Optimization (Tree-GRPO) finds a third path that uses the tree structure itself as the source of process supervision. Tree nodes represent complete agent interaction steps. Rollouts branch at decision points and share common prefixes. When outcome rewards arrive at the leaves, they back-propagate up the tree. At each branching point, the differences between sibling subtrees yield a preference-learning objective — sibling A's subtree did better than sibling B's, so the action choice that led to A gets reinforced over B's.
The key insight: process supervision does not require process-level reward design. The tree structure transforms trajectory-level outcome rewards into step-level preference signals automatically. The depth at which a branching point sits determines the granularity of the preference signal — shallow branches give coarse step-level supervision, deep branches give fine-grained sub-step supervision. Random tree expansion yields process signals of varying granularity without any annotation effort.
This is mechanically distinct from process reward models. PRMs train a separate scoring model on annotated intermediate steps, then use it as a reward signal during agent RL. Tree-GRPO does not train a separate model and does not require step-level annotations. The same outcome rewards that already exist for the task, combined with the structural information in the tree, suffice. The supervision quality differs — PRMs can encode richer notions of "good intermediate step," while Tree-GRPO only knows "this subtree did better than that one" — but the deployment cost is dramatically lower.
For agent-RL deployments where step-level annotation is impractical and outcome rewards are noisy, Tree-GRPO offers a plug-and-play path to process supervision that scales with budget rather than annotator effort.
Related concepts in this collection
-
Can shared-prefix trees reduce redundancy in agent rollouts?
Independent rollouts waste tokens regenerating similar early-turn sequences. Can structuring rollouts as shared-prefix trees instead preserve early computation across samples while maintaining statistical diversity for advantage estimation?
same paper, the efficiency mechanism
-
Does tree depth automatically produce supervision at multiple granularities?
Tree-search rollouts branch at different depths, potentially creating supervision signals ranging from coarse strategy-level to fine-grained detail-level choices. Does this depth variation naturally yield multi-granular process supervision without explicit annotation design?
same paper, the granularity property
-
Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
adjacent: the broader finding that process supervision matters in agent RL
-
Can step-wise expert rewards help small models learn hard reasoning?
When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.
adjacent: another way to convert sparse signals into dense step-level rewards
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
tree-search rollouts in agent RL convert outcome rewards into step-wise process supervision — back-propagating from subtree leaves creates intra-tree advantage estimation