INQUIRING LINE

How do relational reward signals compare to absolute preference encodings in RL?

This explores the contrast between reward models that score by *comparison* — how close a policy is to some target — versus the traditional approach of learning from fixed, absolute preference labels ("this answer is good, that one is bad").


This explores the contrast between *relational* reward signals — which score an output by how close it sits to a reference policy or to other candidates — and *absolute* preference encodings, where each response carries a fixed quality label the model tries to reproduce. The clearest statement of the relational view in the corpus is POLAR, which reframes reward modeling entirely as policy discrimination: instead of memorizing absolute preference labels, the reward model just measures distance from a chosen target policy, scoring outputs higher the more they resemble it Can reward models learn by comparing policies instead of judging them?. The striking practical claim is that this relational framing pre-trains and transfers across task formulations far better than absolute-label methods — suggesting that *distance from a target* is a more learnable and portable signal than *intrinsic goodness*.

That advantage shows up again wherever the corpus replaces an explicit scalar reward with a comparison. Verifier-free RL is converging on exactly this move: pairwise self-judgment (is A better than B?) substitutes for the trained reward model, and internal belief-shift substitutes for the critic — both relational quantities computed from the policy's own behavior rather than absolute labels handed in from outside Can language models replace reward models with internal signals?. A related result is that *negative-only* reinforcement — penalizing wrong trajectories without rewarding right ones — can match or beat full RL, because suppressing the bad preserves diversity while absolute positive rewards collapse probability mass onto a few answers Does negative reinforcement alone outperform full reinforcement learning?. Both point the same direction: the *relation between* candidates often carries more usable signal than any one absolute score.

But the corpus also shows where absolute encodings remain essential — and where the relational/absolute split is the wrong axis entirely. The deeper problem with many absolute encodings is that a single scalar throws information away. Agent feedback decomposes into *evaluative* ('how good was this?') and *directive* ('how should it change?') components, and a scalar reward — relational or absolute — captures only the first Can scalar rewards capture all the information in agent feedback?. Natural-language critiques break through reasoning plateaus precisely because numerical rewards, of any kind, can't say *why* something failed Can natural language feedback overcome numerical reward plateaus?. So the richest signals aren't more cleverly relational — they're feedback that escapes the scalar bottleneck altogether.

There's also a calibration cost hiding in crude absolute encodings. Binary correctness rewards reward confident guessing because they never penalize a confident wrong answer; adding a relational scoring term (the Brier score, which compares stated confidence against outcome) repairs calibration without trading off accuracy Does binary reward training hurt model calibration?. Ternary rewards make the same kind of move structurally — distinguishing correct, hallucinated, and abstained answers so the model can learn *when not to answer*, which a flat binary signal can't express Can three-way rewards fix the accuracy versus abstention problem?. The lesson cutting across these: the win usually comes not from 'relational beats absolute' but from giving the reward *more structure* — more categories, a comparison axis, a directive channel — than a single absolute number allows.

Worth knowing for the curious: there's an even more upstream complication. Before you choose relational or absolute, the human annotations feeding either one aren't a uniform substance — they decompose into genuine preferences, non-attitudes, and on-the-spot constructed preferences, and treating them identically quietly poisons whatever reward model you build Do all annotation responses measure the same underlying thing?. And separating *what to optimize* from *what's feasible* — using rubrics as accept/reject gates rather than as dense reward values — turns out to prevent reward hacking better than folding rubric scores into the reward at all Can rubrics and dense rewards work together without hacking?. The relational-vs-absolute question, in other words, is one slice of a larger one: what *shape* of signal does the policy actually need?


Sources 9 notes

Can reward models learn by comparing policies instead of judging them?

POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Next inquiring lines