How do adversarial IRL and policy discrimination differ in rejecting preference labels?
This explores two different ways of training reward signals that throw out human preference labels — RARO's adversarial critic game and POLAR's policy-distance measurement — and what each is actually rejecting.
This explores two methods that both ditch human preference labels but reject them for different reasons and replace them with different machinery. The shared move is real: both refuse the usual recipe where humans rank outputs A-over-B and a reward model learns to reproduce those rankings. But they part ways on what they substitute in.
Adversarial inverse-RL, as in RARO Can adversarial critics replace task-specific verifiers for reasoning?, rejects preference labels by setting up a game. A critic learns to tell expert demonstrations apart from the policy's own answers, and the policy trains to fool it. There's no human ranking and no domain-specific verifier — the reward emerges from the discriminator's ongoing struggle to spot the difference. What it rejects is the need for an external grader at all; the signal is generated by the adversarial dynamic itself, which is why it works across tasks as different as Countdown math and poetry writing.
POLAR Can reward models learn by comparing policies instead of judging them? rejects preference labels from the opposite direction. Instead of an adversary, it measures distance: a reward model scores how close a policy's behavior sits to a chosen target policy, assigning higher scores the more similar they are. There's no game and no expert/policy adversarial pressure — just a learned notion of "how far are you from where I want you." It rejects absolute preference judgments in favor of relative positioning between policies, and because that's pre-trainable, the resulting reward models transfer across task formulations.
So the contrast is sharp: the adversarial route generates its signal from a moving discriminator trying to catch a moving policy (dynamic, self-renewing, no fixed target), while policy discrimination generates its signal from a static distance to a fixed reference policy (stable, transferable, no adversary). One says "reward is whatever the critic can't dismiss as fake"; the other says "reward is proximity to a known-good policy."
Worth knowing why anyone wants to escape preference labels in the first place. Human annotations aren't a clean signal — they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot answers that contaminate reward models when treated alike Do all annotation responses measure the same underlying thing?. And reward models trained on human approval can teach models to optimize for sounding good over being right, driving truth-indifference rather than truth Does RLHF make language models indifferent to truth?. Both label-free methods sidestep that contamination — and a third cousin, Test-Time RL Can models improve themselves using only majority voting?, does it yet another way, manufacturing rewards from majority-vote consensus rather than from either an adversary or a target policy.
Sources 5 notes
RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.
POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.
Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.