INQUIRING LINE

How does early branch divergence differ from late branch divergence in supervision signals?

This explores a finding from tree-structured RL: when a model branches its reasoning early vs. late in a trajectory, the resulting training signals carry different kinds of information — coarse strategy vs. fine detail.


This reads the question as being about *where* a reasoning tree splits and what that split teaches the model. The cleanest answer in the corpus comes from Tree-GRPO: when branches diverge early in a trajectory, they fork on big strategic choices, so comparing them yields coarse, strategy-level supervision — "was this whole approach right?" When branches diverge late, they share most of their reasoning and differ only on the final steps, so comparing them yields fine-grained, detail-level supervision — "was this particular move right?" The striking part is that nobody schedules this. The multi-resolution signal falls out of the sampling structure itself, no annotation or granularity tuning required Does tree depth automatically produce supervision at multiple granularities?.

What makes this matter is the trick underneath it: tree branching converts a single end-of-trajectory reward into step-by-step process supervision. By comparing sibling subtrees that share a common prefix, the method localizes credit to the steps where siblings actually diverged — turning one outcome score into many step-level preferences, without ever training a separate process reward model or paying for human step annotation Can tree structure alone convert outcome rewards into process supervision?. Early divergence localizes credit to early decisions; late divergence localizes it to late ones. So "early vs. late" isn't two different algorithms — it's the same comparison operating at different depths of shared context.

Worth knowing: this is one of several ways the field is faking expensive process supervision with cheap outcome signals. Reverse-curriculum RL gets there from the opposite direction — it slides the reasoning start point progressively backward from near-completion, so each curriculum stage exposes failures at a different distance from the answer Can curriculum learning approximate expensive process supervision?. Tree branching exposes granularity *spatially* (where siblings split); reverse curriculum exposes it *temporally* (how far from the goal you start). Both manufacture multi-resolution feedback from outcome rewards alone.

There's a broader current here too. A lot of recent work is dissolving the separate reward model entirely, deriving the training signal from the policy's own structure — pairwise self-judgment, internal belief shifts, self-distilled feedback Can language models replace reward models with internal signals?. Tree-structured supervision belongs to that family: the "reward model" is just the geometry of where your own rollouts agree and disagree. If you want to keep pulling this thread, the related question is whether agents can treat the consequences of their own actions as supervision with no external reward at all Can agents learn from their own actions without external rewards? — the logical endpoint of letting structure, rather than a labeler, tell you which steps were good.


Sources 5 notes

Does tree depth automatically produce supervision at multiple granularities?

Tree-GRPO's random expansion strategy naturally produces supervision at varying granularities—early branches provide coarse strategy-level signals while late branches provide fine-grained detail supervision. This multi-resolution signal emerges from sampling structure alone, without annotation effort or granularity scheduling.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Next inquiring lines