Pre-Trained Policy Discriminators are General Reward Models

Paper · arXiv 2507.05197 · Published July 7, 2025
Reinforcement LearningReward Models

Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance.

Instead of traditional absolute preference modeling, we propose redefining a reward model as a “policy discriminator”. Specifically, by quantifying the difference between candidate policies and a given target policy, we establish a criterion-agnostic objective, which naturally assigns higher scores to policies that are more “similar” to the desired target policy. This reward signal could guide the training policy toward desired behaviors during RL. Furthermore, since target policies can be arbitrarily chosen, this objective eliminates reliance on manually defined preferences and is applicable to any scenario, thus offering a scalable and fundamental pre-training paradigm for RMs. We refer to this training objective as Policy Discriminative Learning (POLAR), as illustrated in Figure 1.