Why does a relativistic critic outperform absolute scoring in adversarial reasoning training?
This explores why a critic that judges answers comparatively — distinguishing expert outputs from a policy's own attempts — tends to beat a critic that hands out fixed absolute scores, in the context of adversarial RL for reasoning.
This explores why a relational critic (one that asks 'is this answer better or worse than the alternative?') outperforms one that assigns an absolute number, and the corpus points to a consistent culprit: absolute scoring is brittle in exactly the places reasoning training stresses it most. The clearest anchor is RARO's adversarial setup Can adversarial critics replace task-specific verifiers for reasoning?, where a critic learns to discriminate expert answers from the policy's answers rather than verify correctness against a fixed rubric. Because the critic only has to judge *relative* quality, it sidesteps the need for a domain-specific verifier — and crucially, the target it's chasing moves with the policy, so there's never a fixed score to game.
Why does a fixed score get gamed? Two notes show the failure mechanically. Binary correctness rewards — the purest form of absolute scoring — provably wreck calibration, because a confident wrong answer costs exactly as much as a humble wrong one, so the optimal move is to guess loudly Does binary reward training hurt model calibration?. And group-relative normalization, which is supposed to soften this, backfires when scores are sparse: a rare accidental success on an impossible problem gets treated as a high-advantage trajectory, and the model learns shortcuts and answer-repetition instead of reasoning Do overly hard RLVR samples actually harm model capabilities?. Absolute targets reward whatever crosses the threshold, including degenerate paths.
The deeper reason a relativistic critic helps connects to what RL is actually doing to reasoning at all. Multiple lines of work find that RLVR doesn't expand a model's reasoning frontier — it sharpens sampling toward solutions already latent in the base model Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?, and base models turn out to carry far more latent capability than their default outputs suggest Do base models already contain hidden reasoning ability?. If the job is *selection* rather than teaching, then a critic that contrasts good against bad is doing the natural thing — discriminating — whereas an absolute scorer has to invent a correctness signal it doesn't really have, which is where spurious rewards and shortcut amplification creep in.
There's also an adversarial-robustness angle the question's framing invites. Reasoning models are startlingly fragile to absolute-looking signals: appending irrelevant text spikes error rates How vulnerable are reasoning models to irrelevant text?, manipulative multi-turn prompts knock 25–29% off accuracy Why do reasoning models fail under manipulative prompts?, and a model can ace every fixed benchmark while its internal representation is incoherent Can AI pass every test while understanding nothing?. A fixed scorer inherits all of these blind spots as exploitable surface; a critic trained to tell expert from policy keeps adapting as the policy finds new tricks, which is the whole point of making the contest adversarial.
The thing worth taking away: the win isn't really about 'relative vs absolute math.' It's that absolute scoring assumes you possess ground truth about quality — and across this corpus, that assumption is exactly what fails, whether through calibration collapse Does binary reward training hurt model calibration?, reward-hacking, or benchmarks that can't see inside the model. A relativistic critic survives because it never claims to know the right answer in the absolute — only which of two attempts is closer to one. For the contrast case where you *do* want to shape the reasoning process rather than just rank outcomes, see how metacognitive process rewards earn their signal differently Can RL agents learn to reason better, not just succeed?.
Sources 10 notes
RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.