Can trajectory structure replace hand-annotated process rewards?
Recent methods extract step-level supervision directly from how agent trajectories are structured—trees, expert alignments, tool calls—rather than training separate reward models. Can this structural approach consistently avoid annotation costs?
A pattern recurring across three 2026 methods that solve the same problem from different angles. Each finds a way to convert sparse trajectory-level outcome rewards into dense step-level supervision without requiring a separately-trained process reward model and without requiring step-level human annotations. Each does it by exploiting a structural feature of the trajectory or the training setup itself.
The first, Tree-GRPO: Can tree structure alone convert outcome rewards into process supervision?. The structural feature is tree topology. Rollouts branch at decision points. When outcome rewards arrive at the leaves, they back-propagate up the tree. At each branching point, sibling-subtree differences yield a preference-learning signal — sibling A did better than sibling B, so the action that led to A gets reinforced over B's. Does tree depth automatically produce supervision at multiple granularities?: the depth at which divergence occurs determines the granularity of the resulting signal, and random expansion naturally produces multi-granularity supervision in a single training run.
The second, Supervised RL (SRL): Can step-wise expert rewards help small models learn hard reasoning?. The structural feature is step-level alignment with expert demonstrations. The model is trained to produce reasoning actions, and reward comes from similarity between its actions and expert actions extracted from an SFT dataset — computed step-wise. This provides dense smooth supervision even when every rollout produces a wrong final answer (the regime where outcome-only RL fails entirely).
The third, ToolPO: Can simulated APIs and token-level credit assignment train better tool-using agents?. The structural feature is tool-call position. Rather than backpropagating outcome rewards uniformly across the trajectory, ToolPO attributes advantage specifically to the tokens that constitute tool invocations. A correct tool call in an ultimately successful trajectory gets positive credit; an incorrect tool call still gets penalized even when the trajectory succeeds despite it.
These are three implementations of one design principle: structural features of the trajectory can substitute for separately-trained or hand-annotated process supervision.
The principle matters because process supervision has been the expensive part of agent RL. Process reward models (PRMs) require step-level annotated training data — costly to collect and brittle to construct. Annotation-heavy alternatives have the same problem. The methods catalogued here demonstrate that for at least three trajectory structures (tree topology, expert-aligned action sequences, tool-call positions), the supervision signal is already present in the structure — it just needs to be read out correctly.
The principle generalizes beyond the three methods. Wherever a trajectory has identifiable structural features that correlate with intermediate decision quality, those features can serve as supervision. Action segmentation, attention pattern variance, retrieval call patterns, plan-execution branching — all are candidates. The design space has barely been explored.
Two related earlier notes complete the cluster. Does supervising retrieval steps outperform final answer rewards? establishes empirically that process supervision wins over outcome-only RL for agentic systems — the motivating result that makes this synthesis matter. Why do standard process reward models fail on thinking traces? shows that traditional PRMs degrade when trajectory structure becomes non-linear — exactly the regime where structural-feature methods like Tree-GRPO win.
The methodological lesson: when annotation is the bottleneck, look for structural substitutes. Trajectory geometry is information; it costs nothing to extract.
Source: synthesis across Tasks Planning, Training Fine Tuning, Deep Research
Related concepts in this collection
-
Can tree structure alone convert outcome rewards into process supervision?
Tree-based rollouts naturally create step-level preference signals by comparing sibling subtrees. Can this structural approach replace separate process reward models without explicit step-level annotation?
instance 1: tree topology as supervision source
-
Does tree depth automatically produce supervision at multiple granularities?
Tree-search rollouts branch at different depths, potentially creating supervision signals ranging from coarse strategy-level to fine-grained detail-level choices. Does this depth variation naturally yield multi-granular process supervision without explicit annotation design?
sharpens instance 1: multi-granularity emerges from sampling structure
-
Can shared-prefix trees reduce redundancy in agent rollouts?
Independent rollouts waste tokens regenerating similar early-turn sequences. Can structuring rollouts as shared-prefix trees instead preserve early computation across samples while maintaining statistical diversity for advantage estimation?
secondary property of Tree-GRPO that makes the supervision viable in production
-
Can step-wise expert rewards help small models learn hard reasoning?
When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.
instance 2: expert-step alignment as supervision source
-
Can simulated APIs and token-level credit assignment train better tool-using agents?
Training agents to use real APIs is expensive and unstable, and sparse rewards make it hard to credit the right tool calls. Can combining LLM simulators with fine-grained advantage attribution solve both problems?
instance 3: tool-call positions as supervision source
-
Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
motivating empirical result: process supervision wins over outcome-only RL
-
Why do standard process reward models fail on thinking traces?
Existing PRMs assume clean, sequential steps but reasoning models produce messy trajectories with branching and backtracking. Understanding this mismatch could improve how we supervise and evaluate exploratory reasoning.
why traditional PRMs fail in exactly the regime where structural methods win
-
Can RL agents learn to reason better, not just succeed?
Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?
adjacent: yet another route to process supervision via verifiable meta-reasoning tags
-
Can optimizing attention patterns improve multimodal RL better than optimizing tokens?
Standard RL training optimizes token outputs in multimodal models, but the real bottleneck may be where the model attends to visual information. Does steering attention directly outperform indirect optimization through final outputs?
adjacent: process-vs-outcome principle applied to attention rather than to step-level actions
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
process supervision can be derived from structural features of agent trajectories — sidestepping the annotation cost of process reward models