INQUIRING LINE

How does relative progress estimation reduce dependence on hard labels for process supervision?

This explores how methods that judge 'how far along a solution is' — rather than 'is this exact step correct' — manage to train step-by-step reasoning without human-annotated step labels.


This explores the shift from absolute step labels to relative judgments of progress as the supervision signal. The expensive thing in process supervision has always been the hard label: a human marking each reasoning step correct or wrong, which is what a process reward model normally trains on. The corpus shows several routes around that bottleneck, and the common move is to stop asking 'is this step right?' and start asking 'is this state closer to a solution than that one?' — a relative comparison the system can answer using signals it already has.

The clearest example of relative progress estimation is reverse-curriculum RL Can curriculum learning approximate expensive process supervision?. R3 starts the model near the finished answer and slides the start point backward step by step. Because each start position is a known distance from completion, the model effectively learns where in the chain things go wrong — recovering step-level granularity — using only the final outcome reward. No one labels the steps; the *position along the trajectory* supplies the gradient.

Tree-structured methods reach the same place through comparison rather than position. Tree-GRPO converts a single trajectory-level outcome into step-level preferences by pitting sibling branches against each other Can tree structure alone convert outcome rewards into process supervision?: if two branches diverge at a step and one leads to better outcomes, that *relative* difference becomes the step signal. The branching depth even hands you multiple resolutions for free — coarse strategy-level signal near the root, fine-grained signal in late branches — without anyone scheduling granularity Does tree depth automatically produce supervision at multiple granularities?. More broadly, several systems show that the *structure* of a trajectory — tree topology, expert-aligned actions, tool-call positions — can be mined for dense step signals that hand-annotation used to provide Can trajectory structure replace hand-annotated process rewards?.

The deepest version of this is purely statistical. Self-supervised process reward models replace human step labels with dynamically weighted pseudo-labels and still match strong baselines Can self-supervised process rewards replace human annotation?. And DRO shows a single relative statistic — variance across rollouts of the same query — can do double duty as both a token-level reward and a query filter, with no verifier at all Can one statistical measure serve dual purposes in RL training?. Relative progress, in other words, is often something you can read off the spread of the model's own attempts.

The thing worth knowing you wanted to know: relative signals aren't free of risk. Group-relative normalization — the same mechanism that lets these methods compare rollouts cheaply — can backfire on near-impossible problems, treating a rare lucky success as a high-value trajectory and reinforcing shortcuts instead of reasoning Do overly hard RLVR samples actually harm model capabilities?. So escaping hard labels trades an annotation cost for a calibration problem: the comparisons only teach good process when the problems sit in a range where relative progress actually tracks real progress.


Sources 7 notes

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Does tree depth automatically produce supervision at multiple granularities?

Tree-GRPO's random expansion strategy naturally produces supervision at varying granularities—early branches provide coarse strategy-level signals while late branches provide fine-grained detail supervision. This multi-resolution signal emerges from sampling structure alone, without annotation effort or granularity scheduling.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can self-supervised process rewards replace human annotation?

MetaStone-S1's SPRM achieves o3-mini-level results using dynamic weighting of pseudo-labels instead of human-annotated steps. This eliminates the annotation bottleneck for process supervision, though generalization to fuzzy-outcome domains remains unproven.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Next inquiring lines