Reinforcement Learning with Rubric Anchors

Paper · arXiv 2508.12790 · Published August 18, 2025

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI’s o-series. In RLVR, rewards are derived from deterministic, programmatically verifiable signals— such as passing unit tests in code generation or matching the correct numerical answer in mathematical reasoning. While effective, this requirement for unambiguous correctness largely confines RLVR to domains with clear, automatically checkable outcomes.

To overcome this limitation, we extend the RLVR paradigm beyond strictly verifiable domains by integrating open-ended tasks into the framework through rubric-based reward. In this approach, carefully designed rubrics serve as structured, model-interpretable criteria, enabling the automatic scoring of tasks with inherently subjective or multidimensional outputs. We construct, to our knowledge, the largest rubric reward system to date, comprising over 10,000 rubrics generated by humans, by various LLMs, or via a hybrid human–LLM collaboration.

fundamental challenge: how to construct reward signals that are both reliable and scalable in the absence of explicit ground truth.

Rubric-based reward offers a promising path forward: by defining structured, interpretable criteria for assessment, it can capture multi-dimensional aspects of response quality beyond binary correctness (Bai et al., 2022; Sun et al., 2023; Mu et al., 2024;Wang et al., 2025). While several concurrent works (Guan et al., 2024; Gunjal et al., 2025; Viswanathan et al., 2025; Li et al., 2025) have begun to explore this idea, our work systematically identifies the key components required for rubric-based rewards to be effective in RL training. Not so surprisingly, relying on a single rubric risks reward exploitation, whereas indiscriminately scaling the number of rubrics, whether generated by humans or LLMs, yields only marginal gains. To assess the full potential of our rubric-based training framework, we construct the largest rubric reward bank to date, containing over 10,000 rubrics. Throughout this process, we perform extensive empirical testing and found that success is not trivial. The success/failure hinges tightly on the diversity, granularity, and quantity of the rubrics themselves, as well as on a proper training routine and meticulous data curation. Our training routine adopts a two-stage RL process to progressively enhance model capabilities. The first stage builds a strong constraint-handling foundation through reliable instruction-following and high-quality critic development, using verifiable checks and static, multi-dimensional rubrics. The second stage targets more open-ended, socially grounded, and creative tasks, evaluated via high-quality references and instance-specific rubrics generated by stronger agentic workflows, fostering adaptability and richer expression.

We discover there’s no silver bullet for rubric construction. We perform careful ablation studies for every set of rubrics before integrating them into the training pipeline. The resulting rubrics span multiple scopes: some are grounded in a specific dataset, others are defined at the task level, and some are associated with each data point, similar to the approach used in the Healthbench (Arora et al., 2025) evaluation. These rubrics are generated by human experts, by LLMs (we used either a self-critique model (Qwen3-30B-A3B (Yang et al., 2025)) or a powerful Gemini 2.5 Pro API (DeepMind, 2025)), or through an iterative combination of both.

• Veto Mechanisms: Failure on a critical, non-negotiable dimension (e.g., a reward-hacking detection rubric) can preemptively nullify rewards from all other dimensions, acting as a hard constraint.

• Saturation-Aware Aggregation: We use saturation functions to model the diminishing marginal returns of excelling in a single dimension beyond a certain threshold, encouraging balanced, multifaceted improvements.

• Pairwise Interaction Modeling: The framework can explicitly model synergistic or antagonistic effects between criteria, capturing complex relationships that a simple sum would ignore.

• Targeted Reward Shaping: We employ non-linear mapping functions to selectively amplify score differentials in high-performance regions. This enhances the discriminative power of the reward signal, where scores might otherwise be compressed, providing a more granular gradient for fine-grained optimization.

Our training methodology is a multi-stage reinforcement learning (RL) protocol designed to progressively cultivate a spectrum of capabilities, from precise instruction-following to sophisticated creative and social reasoning. This sequential approach significantly reduces computational overhead while preserving scalability. All data employed in this framework is derived from a proprietary 900K+ instance corpus, curated from diverse sources including community Q&A forums, high-quality examinations, and general conversational datasets, with strategic sampling to ensure broad topical coverage

Offline Data Filtering. A filtering protocol is applied prior to and between RL stages to ensure high-quality training data. For each candidate pool of instruction–rubric pairs, the base model generates responses, which are then scored by our critic models to obtain a full score distribution. We retain only those within a calibrated central quantile—excluding overly high-scoring instances that offer limited learning signal, and very low-scoring ones which may be noisy or low-quality. This yields a balanced, high-potential subset, with the composition further adjusted between stages to target specific capabilities.

Stage-wise RL Training. During our experiments, we observe a “seesaw effect”: jointly training on different task types (e.g., strict constraint-following vs. open-ended creativity) often reduced overall performance, likely due to conflicting optimization objectives. To mitigate this issue, we adopt a simple stage-wise RL schedule as a pragmatic mitigation strategy, without claiming it as a definitive solution.

In the first phase, we emphasize reliable instruction-following and multi-dimensional evaluation alignment, using programmatically verifiable checks and static rubrics to build a strong constraint-handling foundation. In the subsequent phase, we extend to more open-ended, socially grounded, and creative tasks, leveraging reference-based rubrics and instance-specific criteria generated via stronger agentic workflows to promote adaptability.

3.2 Adaptive Defense Against Reward Hacking

A significant challenge encountered during our experiments is the emergence of reward hacking, particularly in the initial RL stages focused on a small number of capabilities. We observe that the model could rapidly learn to exploit specific rubric criteria, resulting in specious reward maximization without genuine improvement. To address this, we implement an adaptive defense strategy.

process begins with an offline analysis of rollout data from these initial training runs. By examining instances where the reward signal is anomalously high, we systematically identify and categorize recurrent, high-level patterns of reward-hacking behavior

The process begins with an offline analysis of rollout data from these initial training runs. By examining instances where the reward signal is anomalously high, we systematically identify and categorize recurrent, high-level patterns of reward-hacking behavior. This empirical analysis informs the development of a dedicated Reward Hacking Defense Rubric (shown in Section A.1). This new rubric is not part of the initial training but is synthesized from the observed failure modes and integrated as a supervisory constraint in all subsequent, more complex RL stages.

The inclusion of this defense mechanism yields substantial improvements in training dynamics. It acts as a critical guardrail, preventing the policy from collapsing into reward-hacking states. This is evidenced by a marked increase in training stability; we are able to conduct longer and more productive training epochs, as the defense rubric mitigated the catastrophic reward spikes that previously rendered continued optimization ineffective. By actively penalizing the exploitation of scoring artifacts, this iterative refinement ensures that the learning process remains focused on substantive capability enhancement.

4 Experimental Results

Our experimental results address three aspects:

• Quantitatively measuring the gains from rubric-based RL training on open-ended, human-centric benchmarks, including assessments of the model’s emotional intelligence (EQ) and its ability to produce human-like responses.

• Qualitatively analyzing how the model’s generated outputs evolve over time, illustrated through representative output showcases.

• Evaluating the impact of rubric-based RL training on general-ability benchmarks.