INQUIRING LINE

What makes binary rewards more effective than richer reward signals?

This explores a counterintuitive claim — that simple binary (right/wrong) rewards can outperform richer, more informative ones — and the corpus suggests the answer is less 'binary is better' and more 'binary works because of what it triggers, not what it teaches,' with real costs alongside the wins.


This explores why a blunt right/wrong signal sometimes beats a richer one, and the corpus reframes the question before answering it: binary rewards don't *teach* much, they *activate* what's already latent. The clearest version of this comes from work showing that reinforcement learning with verifiable rewards mostly improves how efficiently a model samples from capabilities it already had, rather than expanding them (What does reward learning actually do to model reasoning?). If the reward's real job is to surface pretrained behavior, then a crude binary signal is enough — and the striking evidence is that even *spurious* rewards with no correlation to correct answers still boost reasoning in models like Qwen2.5-Math, while leaving Llama and OLMo untouched (Why do random rewards improve reasoning for some models but not others?). The effectiveness isn't in the signal's richness; it's in whether the model's pretraining left something for the signal to switch on.

There's a second, subtler mechanism: simplicity preserves diversity. Training on *only* negative samples — suppressing wrong trajectories without rewarding specific right ones — matches or exceeds full PPO and GRPO, because positive-only reinforcement concentrates probability mass and quietly degrades performance at higher sampling budgets (Does negative reinforcement alone outperform full reinforcement learning?). A richer reward that says 'this exact answer is great' can over-commit the model; a sparse signal that just prunes failures leaves room to keep exploring. So part of binary's edge is that it under-specifies on purpose.

But the corpus is honest that 'more effective' is conditional, and richer signals win on the dimensions binary ignores. Binary correctness rewards provably degrade calibration — they reward confident guessing because they never punish a confident wrong answer — and you need a second term like the Brier score to fix it (Does binary reward training hurt model calibration?). Make the reward three-way instead of two-way and you can teach a model *when to abstain*, cutting hallucinations by ~29% while keeping accuracy (Can three-way rewards fix the accuracy versus abstention problem?). And when models hit a plateau a numerical reward can't move, natural-language critiques explaining *why* an answer failed break through it (Can natural language feedback overcome numerical reward plateaus?) — because a scalar captures evaluation but throws away direction (Can scalar rewards capture all the information in agent feedback?).

The synthesis worth taking away: binary rewards are most effective exactly when the task is *activation* — eliciting latent ability, pruning failures, keeping diversity intact — and least effective when the task is *teaching something new*: calibrating confidence, learning to say 'I don't know,' or escaping a plateau. A promising middle path is keeping the signal categorical where it's strong but using it as a gate rather than a dense score: rubrics that accept or reject whole rollouts prevent the reward-hacking that creeps in when you convert rich judgments into dense numbers (Can rubrics and dense rewards work together without hacking?). So the real lesson isn't binary-vs-rich — it's matching the coarseness of the reward to whether you're unlocking a capability or installing one.


Sources 8 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Next inquiring lines