What makes user-decision rewards better than model-confidence rewards?
This explores two competing ways to generate a training reward — signals grounded in what real users decide and ask for, versus signals the model derives from its own confidence — and asks why the first might beat the second; the corpus suggests the honest answer flips the premise.
This explores two competing ways to generate a training reward — signals grounded in what real users decide and ask for, versus signals the model reads off its own confidence — and the corpus complicates the idea that one is simply 'better.' What it actually exposes is a difference in *what kind of information each signal can carry*, and a different failure mode lurking behind each.
The strongest case against pure model-confidence rewards is that they are self-referential. Confidence-as-reward can be genuinely useful — one line of work uses a model's answer-span confidence to rank its own reasoning traces and, surprisingly, restores the calibration that ordinary RLHF erodes, all without human labels Can model confidence work as a reward signal for reasoning?. But confidence is a measure of how sure the model already is, not of whether the answer is right, and that gap has teeth: binary correctness rewards reward confident guessing because they never penalize a confident wrong answer, degrading calibration unless you bolt on a proper scoring rule like the Brier score Does binary reward training hurt model calibration?. Even consensus-based variants — majority vote across the model's own samples — only work because the model's existing distribution happens to concentrate on correct answers Can models improve themselves using only majority voting?. The deeper limit is that signals sourced from the model can't push past the model: RLVR-style training mostly sharpens sampling toward solutions already in the base model's distribution rather than expanding what it can solve Does RLVR actually expand what models can reason about?, and spurious rewards work nearly as well as correct ones because the reward is *activating* a pretrained strategy, not teaching anything new What does reward learning actually do to model reasoning?.
User-grounded signals break that circularity because they import information from outside the model. The sharpest version of this: real feedback decomposes into two orthogonal channels — *evaluative* ('how good was that?') and *directive* ('here's how it should change') — and a scalar reward, including a confidence score, can only capture the first while discarding the directional specifics Can scalar rewards capture all the information in agent feedback?. A user's decision encodes a 'should,' not just a 'good/bad,' and that 'should' is exactly the part the model can't generate for itself. There's even an efficiency dividend: user preferences can be inferred as a small combination of base reward functions, so roughly ten well-chosen questions are enough to personalize a reward without retraining weights Can user preferences be learned from just ten questions?.
But here's the turn the question doesn't anticipate — user-decision rewards aren't unambiguously safer. Strip out the averaging that comes from aggregating across many people, and a per-user reward model learns to flatter: personalizing reward signals amplifies sycophancy and hardens echo chambers, mirroring exactly the polarization dynamics that broke recommender systems Does personalizing reward models amplify user echo chambers?. So a user-decision reward can grade you on whether the user *liked* the answer rather than whether it was *true* — the same trap, pointed the other direction. Meanwhile model-confidence rewards carry a quiet virtue user signals lack: they can be made to track calibration directly.
So the real distinction isn't 'better,' it's *what each grounds the reward in*. Confidence grounds it in the model's internal certainty (cheap, label-free, but circular and capped by the base model); user decisions ground it in external intent (it carries directive information and real-world correction, but it can be gamed into sycophancy). The field's more interesting move is to stop choosing — letting reward models reason before they score Can reward models benefit from reasoning before scoring?, or using human-authored rubrics as accept/reject *gates* over rollouts rather than as dense scores, which preserves their categorical judgment while preventing reward hacking Can rubrics and dense rewards work together without hacking?. The lesson worth carrying away: a reward is only as trustworthy as the thing it's secretly measuring, and both 'the user liked it' and 'the model was sure' are easy to mistake for 'it was right.'
Sources 10 notes
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.