INQUIRING LINE

Can dynamic variance weighting replace fixed objective combination weights?

This explores whether you can stop hand-tuning the fixed weights that combine multiple training objectives, and instead let each objective's measured variance decide how much it counts — and what the corpus says about variance as a self-tuning signal more broadly.


This explores whether the fixed scalarization constants people normally hand-pick to blend multiple reward objectives can be replaced by weights derived on the fly from how much each objective varies. The corpus has a direct answer and a surprisingly rich set of neighbors. The cleanest yes comes from DVAO How should multiple reward objectives be weighted during training?, which weights each objective by its within-group reward variance per rollout — automatically turning up high-signal objectives and damping noisy ones, with no tuning knobs. Crucially, it isn't just a convenience: variance-weighting keeps advantage magnitudes bounded, which is the actual failure mode fixed constants tend to cause when one objective's scale quietly dominates.

What makes this more than a one-paper trick is that variance keeps showing up as a statistic that does double duty. DRO Can one statistical measure serve dual purposes in RL training? reuses a single cross-rollout variance measure at two levels at once — weighting tokens for dense rewards and filtering out degenerate queries — and gets 2–3× faster, more stable training on unverifiable tasks. So the deeper claim isn't merely "variance can replace a constant," it's that variance is a free, self-supervised signal already sitting in your rollouts, and fixed weights throw that information away.

The corpus also tells you *why* a fixed combination weight is risky in the first place. Binary-reward training provably wrecks calibration unless you bolt on a second term — and the fix there is adding Brier score as a co-objective Does binary reward training hurt model calibration?, which immediately raises the question of how to weight two objectives that pull differently. Variance weighting is one principled answer to exactly that combination problem. And Can utility-weighted training loss actually harm model performance? is the cautionary flip side: utility-weighted loss can sharpen decisions while quietly starving representation learning, a reminder that a weighting scheme optimizing one thing can silently degrade another — so an adaptive weight needs to be measuring the right signal, not just *a* signal.

The limit worth noticing: variance is only a good proxy for "signal" when the underlying reward is trustworthy. The RLVR failure notes show how easily that breaks — overly hard samples make rare accidental successes look like high-advantage trajectories under group-relative normalization Do overly hard RLVR samples actually harm model capabilities?, and outcome-only rewards collapse diversity in ways that transfer to unsolved problems Does outcome-based RL diversity loss spread across unsolved problems?. In those regimes high variance can mark noise you should *down*-weight or garbage you should up-weight, and the statistic can't tell the difference on its own.

So the honest synthesis: dynamic variance weighting can replace fixed combination weights, and the corpus suggests it usually *should* when objectives differ in scale and signal quality — it removes a brittle hyperparameter and keeps advantages bounded. But it inherits the trustworthiness of whatever it's measuring; pair it with query filtering or a proper-scoring co-objective and it's a real upgrade, deploy it over a corrupted reward and it just automates the corruption faster.


Sources 6 notes

How should multiple reward objectives be weighted during training?

DVAO weights objectives by their within-group variance, automatically up-weighting high-signal objectives and suppressing noise without hyperparameter tuning. This keeps advantage magnitudes bounded and replaces fixed scalarization constants with data-driven weighting.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can utility-weighted training loss actually harm model performance?

Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Next inquiring lines