Reasoning and Learning Architectures Agentic Systems and Planning

Does tree depth automatically produce supervision at multiple granularities?

Tree-search rollouts branch at different depths, potentially creating supervision signals ranging from coarse strategy-level to fine-grained detail-level choices. Does this depth variation naturally yield multi-granular process supervision without explicit annotation design?

Note · 2026-05-18 · sourced from Tasks Planning

A subtle but powerful property of tree-search rollouts: the depth at which branches diverge determines the granularity of the resulting process-supervision signal, and Tree-GRPO's random expansion strategy naturally yields signals across multiple granularities in a single training run.

When a branch divergence happens early in the tree, sibling subtrees differ in their high-level approach — different opening moves, different strategic choices, different initial plans. The preference signal at this branching point is coarse: it tells the agent that one strategy worked better than another. When a branch divergence happens late, sibling subtrees differ in fine-grained choices — different word choices in an output, different argument values in a tool call, different specific subgoals within a fixed plan. The preference signal at this branching point is fine-grained: it tells the agent about choices that traditional outcome-only RL cannot isolate.

The random-expansion strategy is what produces the multi-granularity property. Tree-GRPO does not require predetermined branching depths or hand-designed granularity schedules. The sampling process naturally yields some early branches and some late branches per task, and the resulting supervision signal spans the granularity range automatically.

This contrasts with process-reward-model approaches that require explicit decisions about what granularity to supervise at. PRM training data has to be collected at a chosen step-level granularity — too coarse and the model cannot learn fine choices, too fine and annotation cost explodes. The granularity question is itself a design problem that Tree-GRPO sidesteps.

For RL trainers, this means a single Tree-GRPO run produces a richer supervision signal than equivalent investment in PRM-based training would yield, because the tree structure provides multi-resolution supervision as a side effect of sampling. The technique scales with compute budget rather than with annotation budget, which is the right scaling axis for production agent training.

Related concepts in this collection

Concept map
12 direct connections · 89 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

random tree expansion depth maps to process-supervision granularity — Tree-GRPO yields signals at varying granularity without annotation effort