Reasoning and Learning Architectures

Can reward vectors be the hidden source of solution diversity?

Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?

Note · 2026-05-28 · sourced from Reinforcement Learning
What actually changes inside a model during RL training?

Diversity objectives in RL often feel arbitrary — you bolt on an entropy bonus or a novelty penalty and hope it spreads the policy without wrecking quality. Vector Policy Optimization makes the observation that the diversity axis is frequently already present in the reward structure and just gets thrown away. Rewards are vector-valued in practice: per-test-case correctness in code generation, per-criterion ratings in RLHF, per-sub-question success in multi-hop reasoning, or multiple user personas or reward models. Standard pipelines scalarize this vector into one number before computing advantage, discarding the component structure.

The pattern: keep the vector, and use its components as the dimensions along which solutions specialize. Rather than collapsing onto a single Pareto point, VPO combines multi-answer generation with stochastic reward scalarizations, training the model to emit a set of candidates that span the Pareto frontier — one solution that nails edge-case tests, another that optimizes the common path, another that trades correctness for brevity. The diversity is grounded in real trade-offs the task already encodes rather than imposed by an external regularizer, which is why it produces competent diversity rather than noise.

Why it matters: it reframes "where does diversity come from?" The answer is that the multi-objective structure of the reward is the diversity structure, latent until you stop scalarizing. This connects diversity-for-search to the broader multi-objective RL problem: the same vector reward that one method (DVAO) wants to balance for stability, VPO wants to spread across to cover the frontier. The counterpoint is that not every task has a meaningful reward vector — single-answer verifiable tasks with one binary reward offer no natural axis to specialize along.


— "Vector Policy Optimization: Training for Diversity Improves Test-Time Search", https://arxiv.org/abs/2605.22817

Related concepts in this collection

Concept map
13 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

vector-valued rewards give a natural diversity axis by letting solutions specialize along different reward dimensions