Reasoning and Learning Architectures Agentic Systems and Planning

Can trajectory structure replace hand-annotated process rewards?

Recent methods extract step-level supervision directly from how agent trajectories are structured—trees, expert alignments, tool calls—rather than training separate reward models. Can this structural approach consistently avoid annotation costs?

Note · 2026-05-18

A pattern recurring across three 2026 methods that solve the same problem from different angles. Each finds a way to convert sparse trajectory-level outcome rewards into dense step-level supervision without requiring a separately-trained process reward model and without requiring step-level human annotations. Each does it by exploiting a structural feature of the trajectory or the training setup itself.

The first, Tree-GRPO: Can tree structure alone convert outcome rewards into process supervision?. The structural feature is tree topology. Rollouts branch at decision points. When outcome rewards arrive at the leaves, they back-propagate up the tree. At each branching point, sibling-subtree differences yield a preference-learning signal — sibling A did better than sibling B, so the action that led to A gets reinforced over B's. Does tree depth automatically produce supervision at multiple granularities?: the depth at which divergence occurs determines the granularity of the resulting signal, and random expansion naturally produces multi-granularity supervision in a single training run.

The second, Supervised RL (SRL): Can step-wise expert rewards help small models learn hard reasoning?. The structural feature is step-level alignment with expert demonstrations. The model is trained to produce reasoning actions, and reward comes from similarity between its actions and expert actions extracted from an SFT dataset — computed step-wise. This provides dense smooth supervision even when every rollout produces a wrong final answer (the regime where outcome-only RL fails entirely).

The third, ToolPO: Can simulated APIs and token-level credit assignment train better tool-using agents?. The structural feature is tool-call position. Rather than backpropagating outcome rewards uniformly across the trajectory, ToolPO attributes advantage specifically to the tokens that constitute tool invocations. A correct tool call in an ultimately successful trajectory gets positive credit; an incorrect tool call still gets penalized even when the trajectory succeeds despite it.

These are three implementations of one design principle: structural features of the trajectory can substitute for separately-trained or hand-annotated process supervision.

The principle matters because process supervision has been the expensive part of agent RL. Process reward models (PRMs) require step-level annotated training data — costly to collect and brittle to construct. Annotation-heavy alternatives have the same problem. The methods catalogued here demonstrate that for at least three trajectory structures (tree topology, expert-aligned action sequences, tool-call positions), the supervision signal is already present in the structure — it just needs to be read out correctly.

The principle generalizes beyond the three methods. Wherever a trajectory has identifiable structural features that correlate with intermediate decision quality, those features can serve as supervision. Action segmentation, attention pattern variance, retrieval call patterns, plan-execution branching — all are candidates. The design space has barely been explored.

Two related earlier notes complete the cluster. Does supervising retrieval steps outperform final answer rewards? establishes empirically that process supervision wins over outcome-only RL for agentic systems — the motivating result that makes this synthesis matter. Why do standard process reward models fail on thinking traces? shows that traditional PRMs degrade when trajectory structure becomes non-linear — exactly the regime where structural-feature methods like Tree-GRPO win.

The methodological lesson: when annotation is the bottleneck, look for structural substitutes. Trajectory geometry is information; it costs nothing to extract.


Source: synthesis across Tasks Planning, Training Fine Tuning, Deep Research

Related concepts in this collection

Concept map
16 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

process supervision can be derived from structural features of agent trajectories — sidestepping the annotation cost of process reward models