Reasoning and Learning Architectures Reasoning and Knowledge

Can rubrics and dense rewards work together without hacking?

Explores whether reward signals derived from rubrics suffer from exploitation, and whether separating rubric judgments from optimization signals could prevent this failure mode.

Note · 2026-05-18 · sourced from Reasoning Methods CoT ToT

A familiar RL temptation when training on unverifiable tasks: take a rubric that says "good answers do X, Y, Z," score every rollout against the rubric, and treat the score as a dense reward. DRO argues this is exactly the wrong move. Token-level dense rewards alone are vulnerable to reward hacking — a rollout group can produce uniformly low-quality answers that still exhibit relative differences under the token-level metric, misleading the gradient. Rubrics provide the supervision that fixes this. But converting rubric judgments into dense rewards is brittle: rubric scores are noisy, gameable, and discontinuous in ways that dense gradients amplify.

The architectural alternative is to use rubrics as gates rather than as rewards. A rollout group is accepted or rejected based on whether it meets essential task criteria. Rollouts that fail are dropped — they do not contribute to the gradient at all. Rollouts that pass go forward to the token-level dense reward. The two signals serve different functions: the rubric defines feasibility (a hard boundary on what counts as a valid answer); the dense reward defines optimization direction (how to improve among valid answers).

The separation matters because the two signals have different statistical properties. Rubric judgments are good at hard accept/reject decisions ("does this answer cite a source?") and bad at dense gradient supervision ("how much better is answer A than answer B at citing sources?"). Dense rewards are good at fine-grained gradient supervision and bad at hard constraints. Each does what it does well; mixing them inherits the failure modes of both.

The principle generalizes beyond DRO. Whenever an RL setup has both a fine-grained quality signal and a categorical correctness signal, treating the categorical signal as a multiplicative gate rather than as an additive reward preserves its categorical nature and prevents the dense optimizer from finding loopholes in the categorical judgment.

Related concepts in this collection

Can we identify which tokens actually matter for reasoning? Most tokens in an answer are determined by language patterns rather than reasoning. Is there a way to distinguish the small fraction of tokens whose certainty genuinely depends on the chain of thought?
DRO's other component: what to do *within* the gate
Does optimizing against monitors destroy monitoring itself? Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
generalizes the reward-hacking risk: any constraint folded into the reward becomes a target the optimizer learns to circumvent
Can one statistical measure serve dual purposes in RL training? Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.
the third complementary signal in DRO

Concept map

13 direct connections · 100 in 2-hop network ·medium cluster Open in graph ↗

Can rubrics and dense rewards work together with… Can we identify which tokens actually matter for r… Does optimizing against monitors destroy monitorin… Can one statistical measure serve dual purposes in…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

separating optimization from feasibility — dense token-level rewards plus rubric hard-gates on final answers — prevents the reward hacking that pure rubric-derived rewards invite

Can rubrics and dense rewards work together without hacking?

Related concepts in this collection

Related papers in this collection