INQUIRING LINE

Can trajectory structure replace hand-annotated process reward models entirely?

This explores whether the *shape* of an agent's solution path — its branches, tool calls, expert-matched actions — can manufacture step-by-step reward signals on its own, retiring the costly human-labeled process reward models (PRMs) that grade each reasoning step.


This asks whether trajectory structure — the branching, tool-call positions, and action sequences inside an agent's solution path — can manufacture step-level reward signals by itself, making hand-annotated process reward models unnecessary. The corpus says: largely yes, the annotation bottleneck is breaking, but "entirely" overstates it. The strongest evidence is that several independent methods now convert sparse outcome rewards into dense step signals purely from structure Can trajectory structure replace hand-annotated process rewards?. The cleanest case is tree search: by comparing sibling subtrees that branch from the same point, you can read off which steps mattered without anyone labeling them — outcome rewards become process supervision automatically, and the signal gets richer the more compute you spend Can tree structure alone convert outcome rewards into process supervision?. Other structural hooks work too — mining what a search agent reads but doesn't cite as the hardest distractors, which also structurally blocks the model from fabricating its own reward Can search agent behavior yield reliable process rewards for reasoning?.

But the replacement isn't only structural — it's also happening through self-supervision and generation. Self-supervised PRMs reach o3-mini-level performance using dynamically weighted pseudo-labels instead of human step annotations Can self-supervised process rewards replace human annotation?. And a parallel line replaces *discriminative* PRMs (classify each step good/bad) with *generative* judges that reason about the reasoning — multiple teams find these win with orders of magnitude less training data Can judges that reason about reasoning outperform classifier rewards?, echoing the broader finding that letting a reward model think before it scores raises its ceiling Can reward models benefit from reasoning before scoring?. So "replace the PRM" is really two moves at once: drop the human annotations, and often drop the classifier-style PRM too.

The honest qualifier is that trajectory format is harder to score than it looks. Thinking traces branch, backtrack, and are messier than polished answers, so a PRM ported naively onto them degrades — you need one built to treat a failed step as informative exploration rather than an error Why do standard process reward models fail on thinking traces?. That cuts both ways: the same structural richness that lets you derive rewards is what makes structure-blind scoring fail.

The more interesting catch is what pure structure leaves on the table. Agent feedback decomposes into two orthogonal channels — *evaluative* (how good was this) and *directive* (how should it change) — and any scalar-from-structure signal captures the first while discarding the second Can scalar rewards capture all the information in agent feedback?. That's exactly why models plateau on numerical rewards and then jump when handed natural-language critiques explaining *why* they failed Can natural language feedback overcome numerical reward plateaus?. There are also failure modes structure won't fix on its own: binary outcome rewards quietly wreck calibration by rewarding confident guessing Does binary reward training hurt model calibration?, and treating successes and failures identically wastes the asymmetry between them Should successful and failed episodes be processed differently?.

So the surprising takeaway: the field isn't choosing between hand-annotation and trajectory structure — it's discovering that *the supervision was latent in the rollouts all along*, recoverable through tree topology, expert alignment, or read-but-uncited distractors. Hand annotation is being replaced. But "entirely" is the wrong frame, because the richest signals (directive feedback, calibration-aware scoring, language critiques) aren't structural at all — they're a different axis of information that lives alongside structure, not inside it. The likely endgame is hybrid: structure for free dense rewards, generative or language feedback for the *why*.


Sources 11 notes

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can search agent behavior yield reliable process rewards for reasoning?

LongTraceRL mines entity-level reasoning signals from what search agents read but don't cite—the hardest distractors—and applies rubric rewards only to correct answers, structurally blocking reward fabrication while capturing intermediate reasoning quality.

Can self-supervised process rewards replace human annotation?

MetaStone-S1's SPRM achieves o3-mini-level results using dynamic weighting of pseudo-labels instead of human-annotated steps. This eliminates the annotation bottleneck for process supervision, though generalization to fuzzy-outcome domains remains unproven.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Why do standard process reward models fail on thinking traces?

Standard PRMs degrade on trajectory format because thinking traces include branching, backtracking, and weaker coherence than polished responses. ReasonFlux-PRM addresses this by supervising both trajectories and responses, treating failed steps as informative exploration rather than errors.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Next inquiring lines