INQUIRING LINE

Can proper scoring rules fix RLVR's degradation on disagreement prediction?

This explores whether adding a proper scoring rule (like the Brier score) to RLVR training could recover what RLVR erodes — a model's ability to represent the spread of legitimate human disagreement rather than collapsing to one confident answer.


This explores whether a proper scoring rule could repair RLVR's tendency to flatten disagreement into false certainty. The corpus suggests the diagnosis and the proposed cure are talking about the same underlying mechanism — which is encouraging — but it also hints that scoring rules treat a symptom of something more structural.

Start with the wound. RLVR optimizes for deterministic correctness, and that signal actively suppresses a model's sensitivity to legitimate annotation disagreement, with the worst degradation exactly where human variance is highest Why do reasoning models fail at predicting disagreement?. This isn't an isolated quirk. The same convergence pressure shows up elsewhere: RL post-training amplifies one dominant format and collapses the alternatives within a single epoch Does RL training collapse format diversity in pretrained models?, and RLVR narrows sampling toward solutions already in the base model rather than widening the space it explores Does RLVR actually expand what models can reason about?. The common thread is that reward-for-correctness is a distribution-narrowing force, and disagreement prediction needs a distribution-preserving model.

Now the proposed fix, which is the most direct answer in the corpus. Binary correctness rewards provably degrade calibration because they never punish a confidently wrong answer — so the model learns to guess loudly. Adding the Brier score as a second reward term mathematically guarantees that accuracy and calibration get optimized jointly, with no trade-off between them Does binary reward training hurt model calibration?. That maps almost perfectly onto the disagreement failure: a model that suppresses its uncertainty is the same model that can't tell you 60% of annotators saw it one way. A proper scoring rule reintroduces the cost of misplaced confidence, which is precisely the cost RLVR removed.

But here's the thing the reader might not expect: calibrating a single model's confidence and representing genuine disagreement may not be the same target. A scoring rule can make one prediction honestly uncertain, yet there's a deeper representational ceiling — a single aggregate reward model structurally cannot encode a 51-49 split. It is forced to either disappoint 49% always or everyone half the time; this is a representational failure, not a calibration one Can aggregate reward models satisfy genuinely disagreeing users?. A Brier term can fix how confident a model is about its one answer. It cannot, by itself, give the model two answers. And recall that even a perfectly deterministic, low-temperature output is still just one draw from a distribution — consistency is not the same as faithfully reporting the spread Does setting temperature to zero actually make LLM outputs reliable?.

So the honest synthesis: proper scoring rules are the right family of tool and the corpus gives strong, near-proof-level evidence they reverse the calibration half of the damage Does binary reward training hurt model calibration?. Whether they fully fix *disagreement prediction* depends on whether the failure is the model being overconfident (scoring rules help) or the architecture being unable to hold multiple valid views at once (scoring rules don't, and you need something like distributional or multi-objective rewards instead). The corpus frames RLVR's narrowing as a feature of the optimization, not a bug to be patched at the loss layer alone Does RLVR actually expand what models can reason about? Does RL training collapse format diversity in pretrained models? — which suggests a scoring rule is a real repair, but partial.


Sources 6 notes

Why do reasoning models fail at predicting disagreement?

RLVR-trained models degrade significantly at predicting human disagreement distributions, especially when variance is high. The optimization signal for deterministic correctness actively erodes the model's ability to represent multiple valid interpretations.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Next inquiring lines