Reasoning and Learning Architectures

Can reward models learn by comparing policies instead of judging them?

What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?

Note · 2026-05-18 · sourced from Reinforcement Learning
How well do reward models actually evaluate AI reasoning?

Traditional reward modeling presupposes an absolute preference: humans rank responses, the RM learns "good" vs "bad" in that absolute frame, the policy optimizes against that signal. The reliance on manually-defined preferences is exactly what limits scale — every new task domain demands new preference data.

POLAR (2507.05197) redefines what an RM is. Instead of an absolute preference predictor, treat the RM as a policy discriminator: given a candidate policy and a target policy, quantify the difference. Higher scores go to policies more similar to the target. The reward signal guides the training policy toward desired behaviors without ever encoding what those behaviors should be in absolute terms.

The shift is consequential. Since target policies can be arbitrarily chosen, the objective becomes criterion-agnostic — it applies to any scenario where you can describe the desired policy by demonstration rather than by attribute. This eliminates the bottleneck of preference annotation and creates a scalable pre-training paradigm for RMs. Train once on policy discrimination; reuse across many task formulations by varying the target.

The empirical claim is strong: POLAR RMs at 1.8B-7B parameters substantially outperform traditional non-pre-trained methods, significantly enhancing RM performance downstream. The relative framing makes the RM transferable in a way absolute-preference RMs are not.

The deeper move is conceptual. A reward model is not a value judgment — it is a similarity measure to a chosen reference. This connects to Can models learn what makes research worth doing?: both treat reward as a relational construct (similarity to a reference, ranking within a community) rather than an absolute property. The dominant RLHF paradigm trained RMs to encode "what humans want" — POLAR trains them to encode "how close are you to this." The latter scales because it admits any reference policy as target.

A concrete consequence: POLAR fits naturally into the verifier-free RL pattern emerging in late-2025 work. When the target policy is given (e.g., as a demonstration set), no manual preference labels are needed. This is the same move RARO makes via adversarial IRL — both reject the labeled-preference bottleneck — but POLAR's relative framing is general-purpose where RARO is adversarial.


Paper: Pre-Trained Policy Discriminators are General Reward Models

Related concepts in this collection

Concept map
15 direct connections · 88 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

reward models redefined as policy discriminators measure distance from a target policy — criterion-agnostic and scalable