Reinforcement Learning for LLMs

Can self-supervised process rewards replace human annotation?

Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

Process Reward Models (PRMs) provide step-level feedback that outperforms outcome-level evaluation for test-time scaling. But training them requires expensive step-level human annotations — a bottleneck that limits scale.

MetaStone-S1's Self-supervised PRM (SPRM) addresses this: it learns process evaluation from outcome labels alone, using a self-supervised dynamic weighting that gives higher weight to steps whose pseudo-labels (the SPRM's own predictions) are consistent with the final answer's correctness. No human annotation of intermediate steps is required.

The result matches OpenAI o3-mini performance with a 32B parameter model — evidence that self-supervised process supervision can work. But the open question is breadth: math and code have clear, verifiable outcomes (right/wrong is unambiguous). Can the same approach work in domains where outcome correctness is fuzzy — reasoning about complex social situations, medical diagnosis, open-ended writing?

The scale argument for SPRMs is strong: if you can eliminate step-level annotation, you can train PRMs on any domain where outcome labels exist. That's a massive expansion of the training data available for process supervision. The question is whether the quality holds.

Supporting evidence for AI evaluation quality from domain summarization: persona-based summarization of healthcare documents (doctor, patient, general public personas) evaluated with GPT-4 as critic achieved good concordance with human-based critiquing of the same summaries. The finding is domain-specific but points in the same direction — AI evaluation can match human judgment quality in structured evaluation tasks, at least when the evaluation criteria are sufficiently well-defined. This suggests the domain generalization question for SPRMs may be more tractable than the open question implies.

Trajectory-aware PRMs: ReasonFlux-PRM identifies a new requirement as reasoning models adopt the trajectory-response output format (a lengthy exploratory thinking trajectory followed by a polished final response). Standard PRMs, trained on final responses, fail to supervise intermediate thinking trajectories because: (1) thinking trajectories contain branching and self-revision that linear final responses don't; (2) thinking trajectories have weaker global coherence across steps. ReasonFlux-PRM adds trajectory-level supervision alongside step-level supervision to handle both components. The upshot: as R1-style models become standard, the PRM training problem bifurcates — you need a PRM that can evaluate both the exploratory trace AND the polished response, not just the latter. Self-supervised approaches must be extended to handle trajectory-response format explicitly.


Source: Test Time Compute; enriched from Domain Specialization, Reasoning Methods CoT ToT

Related concepts in this collection

Concept map
16 direct connections · 117 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

self-supervised process reward models could replace human-annotated prms at scale