Can utility-weighted training loss actually harm model performance?
When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.
"Misaligned by Design" identifies a failure in what the authors call the Aligned Learning Premise (ALP): the intuition that using the human's utility function to train a model produces better performance in terms of that objective. In high-stakes settings where false positives and false negatives have asymmetric costs (e.g., medical diagnosis), engineers routinely bake these asymmetric weights into the training loss. This paper shows this can backfire.
The key insight: machine classifiers perform not one but two incentivized tasks. Choosing how to classify (given learned features, assign a label) — here asymmetric weighting works correctly. Learning how to classify (acquiring informative feature representations through gradient descent) — here asymmetric weighting can weaken the learning signal. Because the loss function shapes the gradient, it necessarily shapes incentives for learning. Making the loss asymmetric can reduce the payoff to "substantive learning" — the model learns less informative representations.
In both focal applications, training with a standard symmetric loss function then adjusting predictions ex-post according to the human's utility function outperforms training with the utility-weighted loss directly — even when evaluated by the utility-weighted objective itself. Trying to bake utility weights into training makes predictions worse.
This resonates with findings across the LLM training literature. Do reward models actually consider what the prompt asks? shows reward models that should evaluate answer quality actually ignore the question — an incentive misalignment between what the loss teaches and what the evaluation requires. Does supervised fine-tuning actually improve reasoning quality? shows SFT optimizing for accuracy inadvertently degrades reasoning quality — the loss correctly incentivizes choosing the right answer but weakens the incentive to learn informative reasoning paths.
The general principle: when a training objective conflates two functions (learning representations and making decisions), optimizing one can degrade the other. Separating them — learn first, then decide — may be superior even though it seems less elegant. The Does binary reward training hurt model calibration? finding is a direct instance: binary reward correctly incentivizes choosing (pick the right answer) but fails to incentivize learning calibrated confidence, and the Brier score fix explicitly separates these two objectives within the reward function.
Source: Training Fine Tuning
Related concepts in this collection
-
Do reward models actually consider what the prompt asks?
Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
parallel: reward models conflate prompt-free and prompt-related evaluation, degrading learning
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
SFT accuracy objective weakens reasoning learning incentive
-
Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
RLHF objective (helpfulness) weakens conversational grounding — same learning/choosing conflation
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
reasoning objective weakens abstention learning
-
Why do accurate predictions lead to poor decisions?
Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.
the formal framework for the gap: models optimized for prediction produce suboptimal decisions because the loss function conflates learning and choosing; the asymmetric loss finding provides the mechanism (loss shapes gradient for both learning and choosing simultaneously)
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
specific instance of the learning/choosing conflation: binary reward correctly incentivizes choosing (pick the right answer) but fails to incentivize learning calibrated confidence; the Brier score fix separates these incentives, echoing the "train standard then adjust ex-post" prescription
-
Does learning to reward hack cause emergent misalignment in agents?
When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.
the extreme downstream consequence: when the learning/choosing conflation is not detected, RL training can produce reward hacking that generalizes to emergent misalignment; "misaligned by design" (this note's framework) describes the structural vulnerability, while emergent misalignment demonstrates the catastrophic behavioral outcome when that vulnerability is exploited at scale
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
asymmetric loss functions can misalign machine learning because learning and choosing are distinct incentivized tasks — utility-weighted training can backfire