Can reward models learn by comparing policies instead of judging them?

What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?

Note · 2026-05-18 · sourced from Reinforcement Learning

Traditional reward modeling presupposes an absolute preference: humans rank responses, the RM learns "good" vs "bad" in that absolute frame, the policy optimizes against that signal. The reliance on manually-defined preferences is exactly what limits scale — every new task domain demands new preference data.

POLAR (2507.05197) redefines what an RM is. Instead of an absolute preference predictor, treat the RM as a policy discriminator: given a candidate policy and a target policy, quantify the difference. Higher scores go to policies more similar to the target. The reward signal guides the training policy toward desired behaviors without ever encoding what those behaviors should be in absolute terms.

The shift is consequential. Since target policies can be arbitrarily chosen, the objective becomes criterion-agnostic — it applies to any scenario where you can describe the desired policy by demonstration rather than by attribute. This eliminates the bottleneck of preference annotation and creates a scalable pre-training paradigm for RMs. Train once on policy discrimination; reuse across many task formulations by varying the target.

The empirical claim is strong: POLAR RMs at 1.8B-7B parameters substantially outperform traditional non-pre-trained methods, significantly enhancing RM performance downstream. The relative framing makes the RM transferable in a way absolute-preference RMs are not.

The deeper move is conceptual. A reward model is not a value judgment — it is a similarity measure to a chosen reference. This connects to Can models learn what makes research worth doing?: both treat reward as a relational construct (similarity to a reference, ranking within a community) rather than an absolute property. The dominant RLHF paradigm trained RMs to encode "what humans want" — POLAR trains them to encode "how close are you to this." The latter scales because it admits any reference policy as target.

A concrete consequence: POLAR fits naturally into the verifier-free RL pattern emerging in late-2025 work. When the target policy is given (e.g., as a demonstration set), no manual preference labels are needed. This is the same move RARO makes via adversarial IRL — both reject the labeled-preference bottleneck — but POLAR's relative framing is general-purpose where RARO is adversarial.

Paper: Pre-Trained Policy Discriminators are General Reward Models

Related concepts in this collection

Can models learn what makes research worth doing? Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.
both reframe reward as relational: similarity-to-target (POLAR) vs ranking-within-community (RLCF) — neither requires absolute preference labels
Can adversarial critics replace task-specific verifiers for reasoning? Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
RARO uses adversarial discrimination against demonstrations; POLAR uses similarity to a target policy; same anti-labeled-preference move, different mechanism
Can generative reasoning beat discriminative models with less training data? Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.
generative PRMs add reasoning before judging; POLAR adds relative framing — orthogonal axes of RM improvement

Concept map

15 direct connections · 88 in 2-hop network ·medium cluster Open in graph ↗

Can reward models learn by comparing policies in… Can models learn what makes research worth doing? Can adversarial critics replace task-specific veri… Can generative reasoning beat discriminative model…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

reward models redefined as policy discriminators measure distance from a target policy — criterion-agnostic and scalable

Can reward models learn by comparing policies instead of judging them?

Related concepts in this collection

Related papers in this collection