StepWiser: Stepwise Generative Judges for Wiser Reasoning

Paper · arXiv 2508.19229 · Published August 26, 2025
Reinforcement LearningReward ModelsRLVR

As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model’s reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, STEPWISER, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.

As large language models (LLMs) increasingly tackle complex problems, they rely on multi-step reasoning strategies like Chain-of-Thought (CoT) (Wei et al., 2022) and ReAct (Yao et al., 2022) to decompose tasks and formulate better solutions. Consequently, ensuring these intermediate reasoning steps possess logical validity has become a critical research challenge. Process Reward Models (PRMs) have emerged as a potential tool to meet this need, providing step-by-step feedback for supervising learning, instead of relying on a single, often sparse, outcome-based reward (Lightman et al., 2023; Wang et al., 2023). However, this approach suffers from two major drawbacks. First, current PRMs typically function as “black-box” classifiers, providing a score or label without explaining why a step is correct or flawed. Second, their reliance on supervised fine-tuning (SFT) with static datasets can limit their ability to generalize to new reasoning patterns (Lightman et al., 2023; Luo et al., 2024; Wang et al., 2023; Xiong et al., 2024b; Zhang et al., 2024a). In contrast, reasoning models themselves are trained to produce CoTs with reinforcement learning (RL) for best performance (DeepSeek-AI et al., 2025).

In this paper we propose to reward intermediate reasoning steps by first reasoning about those reasoning steps, before making a judgment – a meta-reasoning process which itself is trained by RL. Our overall method (as shown in Figure 1) to build such a stepwise generative judge involves 3 components: (1) a new self-segmentation technique to equip the base policy model with the ability to produce coherent and informative reasoning chunks (chunks-of-thought); (2) assignment of target rewards to chunks via relative outcomes of rollouts; and (3) online training of judgment reasoning chains (i.e., reasoning about reasoning) and final reward judgments via RL. Our stepwise judge, termed STEPWISER, can then be used to provide rewards either at training time or inference time in order to improve the reasoning ability of the policy model.

As depicted in Figure 1, our overall method STEPWISER consists of three components:

• We equip the base policy model with the ability to self-segment Chain-of-Thoughts into coherent and informative reasoning chunks, called Chunks-of-Thought. This is done by creating SFT data with informative segments, so that the model can be trained to selfsegment. We show that this causes no loss in performance for the base model.

• Given the chunks generated by the policy model, we annotate each chunk to create training data for our generative stepwise judge with binary target labels. This is done by comparing outcomes of rollouts starting before and after the given chunk using the outcome rewards.

• We perform online RL training using GRPO which trains our stepwise judge model to produce judgment reasoning chains (i.e., reasoning about reasoning) and reward final judgments that match the chunk labels from the previous step.

Rules for CoT Trajectory Segmentation

Segmentation Principles

  1. Unified purpose: A chunk should serve a single, clear objective. For example: setting up an initial equation, executing a self-contained calculation (like integration by parts), or stating a final/intermediate conclusion. All content within the chunk must directly serve this one core goal.

  2. Logical Cohesion: All lines within a chunk must form a continuous and uninterrupted logical flow. A new chunk should begin as soon as the focus or purpose of the reasoning shifts.

  3. Clear Transition: A new chunk must begin when the problem-solving process enters a new phase. This includes transitioning from ”solving for a variable” to ”verifying the answer,” or inserting an ”explanatory side-note” into the main workflow.

Format rules.

  1. Use ... to mark the beginning and end of each segment. The text and newlines inside the tags must not be altered.

  2. The final output should only contain the tagged content, without any additional text, titles, or blank lines.

  3. You must preserve all original text and newlines exactly as they appear within the tags.

Current methods often segment reasoning trajectories using pre-defined tokens, like “Step 1, Step 2” or simply using double line breaks as delimiters. However, these heuristics frequently result in segments that are neither logically complete nor self-contained. Each segment contains only limited information, making it unsuitable as a standalone unit for a judge model to evaluate effectively. We present a representative example in Table 2 (left), where the model tends to insert double line breaks before and after a mathematical equation. This breaks an intuitively unified logical step into multiple different chunks, where one chunk contains a textual explanation, and the next with the corresponding equation.

Achieving better step definition via self-segmentation To mitigate this issue, we propose a method to teach the model to generate and simultaneously self-segment its own reasoning chains into more meaningful steps. First, we define the criteria for a high-quality reasoning step. The core idea is that each step should represent a complete logical leap or a self-contained part of the problem-solving process. Our definitions are given in Table 1. We then create our training data by:

  1. Generating a set of initial reasoning trajectories from the base model.

  2. Using an LLM prompted with our rules, to automatically segment these trajectories into logically coherent steps.

We fine-tune our base model on this data, thus teaching it to generate and simultaneously self-segment its own reasoning chains automatically. This self-segmentation ability is crucial for two main reasons.

First, it produces more informative and logically complete steps, which provides better context for our judge model and improves its evaluation accuracy. Second, this method significantly reduces the total number of steps per trajectory. This reduction is also important because, as we will show, the process of annotating each step with a quality label is computationally expensive.