INQUIRING LINE

When does a task lack a meaningful multi-dimensional reward structure?

This flips the question: instead of asking when richer reward signals help, it asks when a single scalar reward is genuinely enough — when a task has no real competing objectives to trade off against each other.


This flips the usual question. Most reward-design research argues for *more* dimensions; this asks when extra dimensions are noise. The corpus suggests a clean test: a task has a meaningful multi-dimensional reward structure only when its objectives genuinely compete — when pushing on one axis costs you on another. Where there's no real trade-off, the dimensions collapse into one and a scalar reward is all you need.

You can see the structure most clearly in the cases where it actually exists. Binary correctness alone quietly rewards confident guessing, so accuracy and *calibration* pull against each other — hence adding a Brier-score term Does binary reward training hurt model calibration?. Accuracy and *honesty* compete too, which is why a three-way correct/hallucinate/abstain signal makes abstention learnable instead of punished Can three-way rewards fix the accuracy versus abstention problem?. And when solutions can specialize across test cases, criteria, or personas, vector rewards expose a Pareto frontier — diversity grounded in real task trade-offs rather than bolted-on regularizers Can reward vectors be the hidden source of solution diversity?. The common thread: multi-dimensional structure is *real* when there's a frontier you can't escape by optimizing harder.

So a task lacks that structure when there's no such tension — typically when the model already knows how to do the thing and the reward only has to *select*, not teach. This is the striking finding behind RLVR: training mostly activates strategies already latent in pretraining rather than expanding capability, a single example can suffice, and even spurious rewards work nearly as well as correct ones What does reward learning actually do to model reasoning?. In that regime, the reward isn't carrying much information at all — sophisticated domain reasoning can emerge from nothing more than a basic accuracy signal Can simple rewards alone teach complex domain reasoning?. If a crude scalar gets you most of the way, there were never many dimensions to begin with. The same logic explains why negative reinforcement *alone* can match full RL: when the job is mostly suppressing wrong trajectories while preserving diversity, you don't even need both signs of the signal Does negative reinforcement alone outperform full reinforcement learning?.

There's a subtler trap worth flagging, though: a task can *look* one-dimensional because the reward format is impoverished, not because the task is simple. Natural feedback actually carries two orthogonal things — an evaluative judgment (how good) and a directive (how to change) — and a scalar throws the second away Can scalar rewards capture all the information in agent feedback?. That discarded directional information is exactly what lets language-feedback methods break through plateaus where numerical rewards stall, because the number never said *why* the answer failed Can natural language feedback overcome numerical reward plateaus?. So before concluding a task is genuinely flat, check whether you've just compressed away its structure.

The practical upshot: don't manufacture dimensions a task doesn't have, but don't mistake a flat reward signal for a flat task either. When real structure is missing, the better move is often to *derive* it — turning sparse outcomes into dense step signals from trajectory shape Can trajectory structure replace hand-annotated process rewards?, or using rubrics as gates that accept/reject whole rollouts rather than as extra reward terms to be hacked Can rubrics and dense rewards work together without hacking?. A task lacks meaningful multi-dimensional reward structure precisely when neither competing objectives nor recoverable feedback information are present — and that's exactly when you should stop adding reward terms and let a simple signal do its narrow job.


Sources 10 notes

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can reward vectors be the hidden source of solution diversity?

Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Next inquiring lines