RM-R1: Reward Modeling as Reasoning

Paper · arXiv 2505.02387 · Published May 5, 2025

Reward modeling is essential for aligning large language models with human preferences through reinforcement learning from human feedback. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM’s interpretability and performance. To this end, we introduce a new class of generative reward models – Reasoning Reward Models (REASRMS) – which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of REASRMS, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism – self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve state-of-the-art performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%.

In real-world decision-making scenarios, accurate and grounded reward modeling often requires jointly conducting reasoning and reward assignment. This is because preference judgments inherently involve multifaceted cognitive considerations, such as inferring a judge’s latent evaluation criteria [5], navigating trade-offs among multiple criteria [23], and simulating potential consequences [33], all of which necessitate extensive reasoning. Our example in Figure 1 illustrates such an example, where a correct preference judgement requires accurate perception of the question, understanding of the corresponding evaluation rubrics with convincing arguments – closely mirroring how humans approach grading tasks. Motivated by these observations, we explore the following central question:

Can we cast reward modeling as a reasoning task?

In this work, we unleash the reasoning potential of RMs and propose a new class of models:

Reasoning Reward Models (REASRMS). Different from standard GenRMs, REASRMS emphasize leveraging long and coherent reasoning chains during the judging process to enhance the model’s ability to assess and distinguish complex outputs accurately. We validate that integrating long reasoning chains during the judging process significantly enhances downstream reward model performance. We explore several strategies for adapting instruction-tuned language models into logically coherent REASRMS. Notably, we find that solely applying reinforcement learning with verifiable rewards (RLVR) [12] in reward modeling does not fully realize the model’s reasoning capabilities. We also observe that plain chain-of-thought (CoT) reasoning falls short at perceiving the fine-grained distinction across different question types.

Through a series of studies, we design a training pipeline that introduces reasoning distillation prior to RLVR, ultimately resulting in the development of RM-R1. To fully elicit the reasoning capability of RM-R1 for reward modeling, we design a Chain-of-Rubrics (CoR) process. Specifically, the model categorizes the input sample into one of two categories: chat or reasoning. For chat tasks, the model generates a set of evaluation rubrics, justifications for the rubrics, and evaluations tailored to the specific question. For reasoning tasks, correctness is the most important and generally preferred rubrics, so we directly let the model first solve the problem itself before evaluating and picking the preferred response. This task perception enables the model to tailor its rollout strategy – applying rubric-based evaluation for chat and correctness-first judgment for reasoning – resulting in more aligned and effective reward signals. In addition, we explore how to directly adapt existing reasoning models into reward models. Since these models have already undergone substantial reasoning-focused distillation, we fine-tune them using RLVR without additional distillation stages. Based on our training recipes, we produce RM-R1 models ranging from 7B to 32B.