Reasoning and Learning Architectures

How should multiple reward objectives be weighted during training?

When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.

Note · 2026-05-28 · sourced from Reinforcement Learning
What actually changes inside a model during RL training?

When a single GRPO run optimizes several rewards at once — accuracy plus length plus format, say — the standard moves both fail. Reward Combination sums the rewards before computing advantage, which lets the squared advantage magnitudes explode and destabilize training. Advantage Combination computes advantages per objective and then mixes them, but it leans on fixed hyperparameter weights and treats objectives as independent, ignoring how they correlate within a rollout.

DVAO's claim is that the right weighting signal is already sitting in the data: the empirical reward variance of each objective within a rollout group. High variance means the group's responses spread out on that objective — there is a gradient to learn from. Low variance means the objective is either saturated or noise, so its contribution should shrink. Weighting by within-group variance therefore up-weights objectives carrying a real learning signal and down-weights the rest, automatically and without tuned constants.

Why it matters: this reframes multi-objective scalarization as an estimation problem rather than a preference-setting problem. You are no longer asking "how much do I value format versus accuracy?" but "which objective currently has signal worth following?" The paper proves the scheme keeps advantage magnitudes bounded (the stability win) and folds in a cross-objective regularizer so each objective's gradient is modulated by the rollout's overall multi-objective performance. The counterpoint is that variance can be a misleading proxy — a noisy reward model also produces high variance — so the method presumes reward signals are clean enough that spread tracks learnability.


— "DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning", https://arxiv.org/abs/2605.25604

Related concepts in this collection

Concept map
13 direct connections · 122 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

multi-reward grpo should weight each objective by its empirical reward variance