Can vector-valued rewards preserve specialization better than variance-weighted advantages?
This explores whether keeping rewards as multi-dimensional vectors — instead of collapsing them into a single number weighted by uncertainty — better protects a model's ability to specialize, where different solutions get good at different things.
This explores whether keeping rewards as multi-dimensional vectors — instead of collapsing them into a single number weighted by uncertainty — better protects a model's ability to specialize. The corpus's most direct answer is yes, and the reason is mechanical rather than philosophical: the moment you scalarize, you average, and averaging is where specialization dies. Vector Policy Optimization shows that when rewards are decomposed per test-case, criterion, or persona and left *unscalarized*, the dimensions themselves become a natural diversity axis — solutions can sit at different points on the Pareto frontier rather than all collapsing toward whatever the weighting favors Can reward vectors be the hidden source of solution diversity?. A variance-weighted advantage is still ultimately one scalar; it reweights how much each signal counts, but it then sums them, and that summation is exactly the step that erases the trade-off structure a specialist depends on.
The clearest illustration of why collapsing-to-scalar hurts comes from the personalization work. Aggregate reward models specialize *less* precisely because they average across users; remove that averaging and per-user reward models recover specialization — so much so that they can over-specialize into sycophancy and echo chambers Does personalizing reward models amplify user echo chambers?. That's the same averaging dynamic a scalarized advantage imposes, just at the level of objectives instead of users. The lesson cuts both ways: keeping signals separate preserves specialization, but specialization unconstrained is its own failure mode.
There's also a deeper information argument lurking here. Scalar rewards can't jointly carry everything in a feedback signal — agent feedback decomposes into *evaluative* (how good was this) and *directive* (how should it change) components, and a single number captures the first while discarding the second Can scalar rewards capture all the information in agent feedback?. A variance-weighted advantage is a sophisticated way of computing that single evaluative number; it can't recover the directional structure that a vector retains by construction. This is why several other approaches reach for richer-than-scalar representations: separating rubrics-as-gates from token-level rewards keeps a categorical signal intact instead of melting it into a dense score Can rubrics and dense rewards work together without hacking?, and adding a second reward term (Brier score) is what stops binary correctness from collapsing accuracy and calibration into a single objective that trades one off against the other Does binary reward training hurt model calibration?.
One caveat worth carrying away: whether *any* preference signal preserves or destroys diversity is domain-dependent. RLHF compresses lexical-syntactic diversity in code, where the task rewards convergence to one correct answer, but *increases* it in creative writing, where distinctiveness is the reward Does preference tuning always reduce diversity the same way?. So vector-valued rewards don't manufacture specialization out of nothing — they preserve the specialization the task actually contains. If the underlying task has genuine trade-offs to span, keeping the reward a vector lets solutions occupy them; a variance-weighted scalar, however cleverly weighted, still picks one point and pulls everyone toward it.
The thing you might not have known you wanted: 'specialization' and 'diversity' turn out to be the same property viewed from different angles, and both are killed by the identical operation — summation. The interesting design question is therefore not 'which weighting scheme,' but 'how late can you afford to collapse the vector,' since every approach in this corpus that protects specialization does so by *delaying or refusing* the scalarization step.
Sources 6 notes
Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.