Reinforcement Learning for LLMs

Can judges that reason about reasoning outperform step classifiers?

Does framing step-level reward as a reasoning task rather than classification improve how well models evaluate intermediate steps in chains of thought? This matters because current process reward models lack transparency and struggle to generalize.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Current process reward models (PRMs) have two major limitations: they function as black-box classifiers providing scores without explanations, and their reliance on SFT with static datasets limits generalization. StepWiser addresses both by reframing stepwise reward as a reasoning task rather than a classification task.

The architecture has three components. First, self-segmentation: the base policy model learns to segment its own chains-of-thought into coherent "chunks of thought" — each representing a complete logical leap rather than arbitrary step boundaries. This reduces total segments and produces more informative units. Second, chunk annotation: each chunk receives a binary label by comparing outcomes of rollouts starting before and after the chunk. Third, RL training: the judge model is trained via GRPO to produce judgment reasoning chains (reasoning about reasoning) before delivering a verdict.

The self-segmentation is critical. Current methods segment at "Step 1, Step 2" markers or double line breaks, producing fragments that are neither logically complete nor self-contained. StepWiser's segments each serve a single clear objective — setting up an equation, executing a calculation, stating a conclusion. This gives the judge model meaningful units to evaluate.

The meta-reasoning aspect — the judge reasoning about the policy model's reasoning — is what distinguishes this from traditional PRMs. The judge doesn't just classify steps as correct/incorrect; it articulates WHY a step is correct or flawed. Since Can self-supervised process rewards replace human annotation?, StepWiser advances this further by making the reward model generative and explainable.

The practical results: better judgment accuracy on intermediate steps, improved policy model training, and better inference-time search. The approach also connects to the emerging pattern that since Does chain of thought reasoning actually explain model decisions?, having a dedicated judge that explicitly reasons about reasoning quality may be more reliable than relying on the reasoning trace itself.

Dual confirmation from GenPRM and ThinkPRM: Two independent papers reinforce the generative-over-discriminative advantage with striking data efficiency results. GenPRM shows that a 1.5B generative PRM outperforms GPT-4o as a discriminative verifier — the generation objective forces the model to understand why a step is correct or flawed, not just classify it. ThinkPRM demonstrates even more extreme efficiency: using only 1% of the PRM800K dataset beats full-dataset discriminative PRMs, because the reasoning-before-judging approach extracts more signal per training example. Both confirm that process verification benefits from the same "think before judging" principle that makes generative approaches more data-efficient across domains. See Can generative reasoning improve process reward model efficiency?.


Source: Reinforcement Learning, RLVR

Related concepts in this collection

Concept map
14 direct connections · 96 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

generative stepwise judges that meta-reason about reasoning steps outperform classifier-based process reward models