Why does random tree expansion avoid the granularity design problem of process-reward models?

This explores why Tree-GRPO's random branching sidesteps a hard design choice baked into process-reward models (PRMs): deciding how finely to chop a reasoning trajectory into 'steps' worth scoring.

This explores why random tree expansion avoids a problem that haunts process-reward models — the question of granularity, i.e. how finely you should slice a trajectory before scoring each piece. With a conventional PRM, someone has to decide what counts as a 'step,' annotate at that resolution, and live with the consequences: too coarse and you miss the local mistake, too fine and you drown in annotation cost and noise. The granularity is a hand-set knob.

The key move is that tree expansion makes granularity *emerge from sampling structure* rather than from design. In Tree-GRPO, branches near the root naturally produce coarse, strategy-level distinctions, while branches deeper in the tree distinguish fine-grained details — so a single random expansion yields supervision at multiple resolutions at once, with no annotation effort and no granularity schedule to tune Does tree depth automatically produce supervision at multiple granularities?. The step signal itself comes for free: comparing sibling subtrees converts a trajectory-level outcome reward into step-level preferences, so you never need a separately trained PRM or step labels at all Can tree structure alone convert outcome rewards into process supervision?.

That's an instance of a broader pattern in the corpus: process supervision can be *derived from the structure of a trajectory* instead of trained as a separate model. Different methods exploit different structural features — tree topology, expert-aligned actions, tool-call positions — to turn sparse outcome rewards into dense step signals Can trajectory structure replace hand-annotated process rewards?. MCTS-based self-improvement makes the same bet from a different angle: tree search naturally ranks solution paths by success, generating quality signals that stand in for the human annotation oracle RLHF normally needs Can tree search replace human feedback in LLM training?.

What's worth noticing is the contrast with the other branch of PRM research, which doesn't try to escape granularity design but to make the judge *smarter* at it. There, the trend is to have reward models reason before they score — generative step-wise judges that meta-reason about each reasoning step outperform classifier-style PRMs with far less training data Can judges that reason about reasoning outperform classifier rewards?, and adding chain-of-thought before scoring lets reward models scale their judgment at test time Can reward models benefit from reasoning before scoring?. Those approaches still presuppose a defined step to evaluate; they invest in evaluating it well.

So the deeper answer is that there are two ways out of the granularity problem. One is to build a better evaluator. The other — Tree-GRPO's — is to change where the signal comes from: let the geometry of how you sampled the answers carry the resolution information, so 'what's a step?' stops being a knob you set and becomes a byproduct of how deep you happened to branch.

Sources 6 notes

Does tree depth automatically produce supervision at multiple granularities?

Tree-GRPO's random expansion strategy naturally produces supervision at varying granularities—early branches provide coarse strategy-level signals while late branches provide fine-grained detail supervision. This multi-resolution signal emerges from sampling structure alone, without annotation effort or granularity scheduling.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Why does random tree expansion avoid the granularity design problem of process-reward models?

Sources 6 notes

Next inquiring lines