Can reward vectors be the hidden source of solution diversity?
Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?
Diversity objectives in RL often feel arbitrary — you bolt on an entropy bonus or a novelty penalty and hope it spreads the policy without wrecking quality. Vector Policy Optimization makes the observation that the diversity axis is frequently already present in the reward structure and just gets thrown away. Rewards are vector-valued in practice: per-test-case correctness in code generation, per-criterion ratings in RLHF, per-sub-question success in multi-hop reasoning, or multiple user personas or reward models. Standard pipelines scalarize this vector into one number before computing advantage, discarding the component structure.
The pattern: keep the vector, and use its components as the dimensions along which solutions specialize. Rather than collapsing onto a single Pareto point, VPO combines multi-answer generation with stochastic reward scalarizations, training the model to emit a set of candidates that span the Pareto frontier — one solution that nails edge-case tests, another that optimizes the common path, another that trades correctness for brevity. The diversity is grounded in real trade-offs the task already encodes rather than imposed by an external regularizer, which is why it produces competent diversity rather than noise.
Why it matters: it reframes "where does diversity come from?" The answer is that the multi-objective structure of the reward is the diversity structure, latent until you stop scalarizing. This connects diversity-for-search to the broader multi-objective RL problem: the same vector reward that one method (DVAO) wants to balance for stability, VPO wants to spread across to cover the frontier. The counterpoint is that not every task has a meaningful reward vector — single-answer verifiable tasks with one binary reward offer no natural axis to specialize along.
— "Vector Policy Optimization: Training for Diversity Improves Test-Time Search", https://arxiv.org/abs/2605.22817
Related concepts in this collection
-
How should multiple reward objectives be weighted during training?
When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.
the dual move on the same vector reward: DVAO balances components for stability while VPO spreads solutions across components for coverage
-
Can diversity optimization improve quality during language model training?
Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
alternative diversity source (semantic, in output space) versus VPO's reward-component source; both refute the diversity-costs-quality assumption
-
Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
another mechanism for sustaining diverse competent candidates, via critique rather than reward decomposition
-
Does outcome-based RL diversity loss spread across unsolved problems?
When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?
diagnoses the diversity-loss failure that vector rewards are one structural antidote to
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
vector-valued rewards give a natural diversity axis by letting solutions specialize along different reward dimensions