What makes utility-weighted training backfire in machine learning systems?

This explores why training a model to optimize directly for the outcome you care about — weighting the loss toward high-stakes decisions, correct answers, or high-quality data — can quietly make the model worse at the very thing it's learning.

This explores why training a model to optimize directly for the outcome you care about — weighting the loss toward high-stakes decisions, the reward you want, or the 'best' data — can quietly make the model worse at the very thing it's learning. The corpus points to a recurring mechanism: utility weighting collapses the rich signal a model needs to *learn* down to the narrow signal it needs to *decide*, and the two are not the same job.

The cleanest statement of this is the finding that asymmetric, utility-weighted loss functions strengthen decision-making while actively weakening representation learning — by shrinking the gradient signal for acquiring substantive features, they make the model a better chooser on top of a thinner understanding. The striking fix is to decouple the two: train with a plain symmetric loss, then adjust predictions for utility *afterward*, which beats baking utility into training on the very same utility objective Can utility-weighted training loss actually harm model performance?. The same shape shows up in reinforcement learning with binary correctness rewards: because a binary reward never punishes a confident wrong answer, it teaches the model to guess loudly, wrecking calibration — fixable only by adding a second term (the Brier score) that restores the penalty the utility signal stripped out Does binary reward training hurt model calibration?.

The backfire gets worse when the weighting amplifies rare, lucky events. Training on near-impossible RLVR problems sounds like high-value practice, but group-relative normalization treats a stray accidental success as a high-advantage trajectory and reinforces it — so the model learns answer-repetition and computation-skipping shortcuts that then contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. A subtler cousin: positive-only reinforcement concentrates probability mass onto whatever currently scores well, degrading diversity and higher-k performance, whereas suppressing the wrong answers (negative reinforcement) preserves the spread and often matches full RL Does negative reinforcement alone outperform full reinforcement learning?. Weighting toward the winner, it turns out, is exactly what narrows the model.

Utility weighting also backfires when 'high utility' is judged from the outside rather than relative to the learner. Teacher-refined instruction data that is objectively higher quality still *degrades* a student model when it sits beyond the student's learning frontier — the fix is to let the student filter for compatibility with its own profile rather than swallow everything labeled good Does teacher-refined data always improve student model performance?. And when the utility signal is read off the system's own past behavior, it closes a loop: YouTube's multi-objective ranker has to explicitly model selection bias, because without it the model converges on degenerate equilibria that just amplify its own prior decisions Why do ranking systems need to model selection bias explicitly?.

The through-line the reader may not have expected: the most reliable cure across all these cases is *not* a better utility weight but a structural separation — keep the learning signal honest and apply the utility pressure somewhere else. Decode-time proxy tuning preserves pretrained knowledge precisely by never touching the weights that store it Can decoding-time tuning preserve knowledge better than weight fine-tuning?, and difficulty-based data pruning shows weighting *can* help — but only when it removes redundancy rather than chasing the objective directly Can we prune training data without hurting model performance?. Utility weighting backfires when you let the thing you want to optimize stand in for the thing you need to learn.

Sources 8 notes

Can utility-weighted training loss actually harm model performance?

Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can we prune training data without hurting model performance?

Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.

What makes utility-weighted training backfire in machine learning systems?

Sources 8 notes

Next inquiring lines