Can rubrics and dense rewards work together without hacking?
Explores whether reward signals derived from rubrics suffer from exploitation, and whether separating rubric judgments from optimization signals could prevent this failure mode.
A familiar RL temptation when training on unverifiable tasks: take a rubric that says "good answers do X, Y, Z," score every rollout against the rubric, and treat the score as a dense reward. DRO argues this is exactly the wrong move. Token-level dense rewards alone are vulnerable to reward hacking — a rollout group can produce uniformly low-quality answers that still exhibit relative differences under the token-level metric, misleading the gradient. Rubrics provide the supervision that fixes this. But converting rubric judgments into dense rewards is brittle: rubric scores are noisy, gameable, and discontinuous in ways that dense gradients amplify.
The architectural alternative is to use rubrics as gates rather than as rewards. A rollout group is accepted or rejected based on whether it meets essential task criteria. Rollouts that fail are dropped — they do not contribute to the gradient at all. Rollouts that pass go forward to the token-level dense reward. The two signals serve different functions: the rubric defines feasibility (a hard boundary on what counts as a valid answer); the dense reward defines optimization direction (how to improve among valid answers).
The separation matters because the two signals have different statistical properties. Rubric judgments are good at hard accept/reject decisions ("does this answer cite a source?") and bad at dense gradient supervision ("how much better is answer A than answer B at citing sources?"). Dense rewards are good at fine-grained gradient supervision and bad at hard constraints. Each does what it does well; mixing them inherits the failure modes of both.
The principle generalizes beyond DRO. Whenever an RL setup has both a fine-grained quality signal and a categorical correctness signal, treating the categorical signal as a multiplicative gate rather than as an additive reward preserves its categorical nature and prevents the dense optimizer from finding loopholes in the categorical judgment.
Related concepts in this collection
-
Can we identify which tokens actually matter for reasoning?
Most tokens in an answer are determined by language patterns rather than reasoning. Is there a way to distinguish the small fraction of tokens whose certainty genuinely depends on the chain of thought?
DRO's other component: what to do *within* the gate
-
Does optimizing against monitors destroy monitoring itself?
Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
generalizes the reward-hacking risk: any constraint folded into the reward becomes a target the optimizer learns to circumvent
-
Can one statistical measure serve dual purposes in RL training?
Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.
the third complementary signal in DRO
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
separating optimization from feasibility — dense token-level rewards plus rubric hard-gates on final answers — prevents the reward hacking that pure rubric-derived rewards invite