Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Paper · arXiv 2507.17746 · Published July 23, 2025
RLVRReward ModelsDomain Specialization

Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth—making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce Rubrics as Rewards (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a 28% relative improvement on HealthBench-1k compared to simple Likertbased approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.

However, many real-world tasks lack such explicit verifiable answers, leaving models without direct reward feedback. In practice, researchers often turn to RLHF via preference ranking— collecting human judgments over pairs or lists of model outputs to fill this gap. While preferencebased reward models can bootstrap performance, they tend to overfit superficial artifacts (e.g. response length, formatting quirks, annotator biases) (Singhal et al., 2023; Wang et al., 2024; Chen et al., 2024b; Ye et al., 2024; Gudibande et al., 2023) and require large volumes of pairwise comparisons (Ouyang et al., 2022), making preference-based reward models brittle and costly.

To address this gap, we introduce Rubrics as Rewards (RaR) for on-policy Reinforcement Learning that treats structured criteria or rubrics as the core reward mechanism. We use rubrics as checklists (Arora et al., 2025; Sirdeshmukh et al., 2025) composed of independent subgoals allowing for automatable feedback aligned with expert intent. By decomposing “what makes a good response” into tangible, human-interpretable criteria, rubrics offer a middle ground between binary correctness signals and coarse preference rankings.

Previous works train generative reward models that learn to evaluate reasoning or final outputs with interpretable scores (Chen et al., 2025; Whitehouse et al., 2025; Anugraha et al., 2025; Guo et al., 2025b), and some have even used a model’s internal confidence estimates as a proxy for reward (Zhao et al., 2025). Concurrently, recent efforts have extended verifiable datasets beyond STEM domains, broadening the applicability of RLVR methods to a wider range of tasks (Su et al., 2025b; Ma et al., 2025). However, developing a general-purpose approach for specifying reliable and scalable rewards remains an open challenge, particularly in settings where no single ground truth exists, or where both subjective and objective criteria must be considered to evaluate response quality. The Rubrics as Rewards strategy offers a flexible solution by repurposing structured evaluation criteria into multidimensional reward signals. Figure 1 illustrates our approach for generating rubrics and leveraging them as reward signals for on-policy training.