Why do veto mechanisms on critical dimensions prevent collapse into exploitable reward modes?

This explores why treating a critical quality dimension as a hard gate (reject the whole output if it fails) resists reward hacking better than folding that dimension into a weighted score the optimizer can trade against.

This explores why treating a critical quality dimension as a hard *veto* — reject the whole rollout if it fails — beats blending that dimension into a numeric reward. The corpus has a sharp answer, and it starts with a structural point about what scoring loses. The clearest demonstration is DRO Can rubrics and dense rewards work together without hacking?, which shows that using rubrics as *gates* to accept or reject whole rollout groups prevents reward hacking, while converting those same rubric scores into a dense numeric reward invites it. The reason is the math of averaging: once a critical dimension becomes one number among many, an optimizer can earn a high total by piling up cheap wins elsewhere and eating the penalty on the dimension that actually matters. A veto removes that trade — there is no score to compensate against, so the only path to reward runs through satisfying the constraint first.

Why collapse happens without the veto is itself a corpus theme. The self-improvement work Can models reliably improve themselves without external feedback? names reward hacking as one of the structural ways optimization degenerates when there's no external anchor — models drift toward whatever the metric rewards rather than what it was meant to measure. A veto on a critical dimension acts as exactly that anchor: a piece of signal the policy cannot negotiate around. Strip it out and you get the failure DRO warns about; keep it categorical and you preserve a floor the optimizer can't buy its way past.

There's a deeper reason scalar rewards are so easy to game, and two notes point at it from different angles. Agent feedback decomposes into *evaluative* and *directive* parts Can scalar rewards capture all the information in agent feedback?, and a single scalar captures the 'how good' while discarding the 'what to fix.' Critique-GRPO Can natural language feedback overcome numerical reward plateaus? makes the same point from the plateau side: numerical rewards lack the information about *why* a failure happened, which is why language critiques can break through ceilings that more scaling can't. A reward mode is exploitable precisely because it has compressed away the structure that would catch the exploit — so a veto, which preserves the categorical 'this is disqualifying' verdict, is recovering information the scalar threw out.

The complementary lever is making the reward itself harder to game rather than gating it after the fact. Causal reward modeling via counterfactual invariance Can counterfactual invariance eliminate reward hacking biases? forces the reward to stay constant when irrelevant variables change, which eliminates length bias, sycophancy, and other spurious shortcuts — the exact 'exploitable modes' the question asks about. Read together, the corpus describes two defenses against the same disease: veto-as-gate keeps a critical dimension uncompromisable, and causal invariance keeps the optimizer from confusing a spurious feature for quality in the first place. The unifying insight worth taking away: reward hacking isn't a model being clever — it's an artifact of compressing multi-dimensional judgment into a single tradeable number, and the fix is to refuse the compression exactly where it's most dangerous.

Sources 5 notes

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Why do veto mechanisms on critical dimensions prevent collapse into exploitable reward modes?

Sources 5 notes

Next inquiring lines