Reinforcement Learning for LLMs

Can utility-weighted training loss actually harm model performance?

When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.

Note · 2026-02-22 · sourced from Training Fine Tuning
What kind of thing is an LLM really? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

"Misaligned by Design" identifies a failure in what the authors call the Aligned Learning Premise (ALP): the intuition that using the human's utility function to train a model produces better performance in terms of that objective. In high-stakes settings where false positives and false negatives have asymmetric costs (e.g., medical diagnosis), engineers routinely bake these asymmetric weights into the training loss. This paper shows this can backfire.

The key insight: machine classifiers perform not one but two incentivized tasks. Choosing how to classify (given learned features, assign a label) — here asymmetric weighting works correctly. Learning how to classify (acquiring informative feature representations through gradient descent) — here asymmetric weighting can weaken the learning signal. Because the loss function shapes the gradient, it necessarily shapes incentives for learning. Making the loss asymmetric can reduce the payoff to "substantive learning" — the model learns less informative representations.

In both focal applications, training with a standard symmetric loss function then adjusting predictions ex-post according to the human's utility function outperforms training with the utility-weighted loss directly — even when evaluated by the utility-weighted objective itself. Trying to bake utility weights into training makes predictions worse.

This resonates with findings across the LLM training literature. Do reward models actually consider what the prompt asks? shows reward models that should evaluate answer quality actually ignore the question — an incentive misalignment between what the loss teaches and what the evaluation requires. Does supervised fine-tuning actually improve reasoning quality? shows SFT optimizing for accuracy inadvertently degrades reasoning quality — the loss correctly incentivizes choosing the right answer but weakens the incentive to learn informative reasoning paths.

The general principle: when a training objective conflates two functions (learning representations and making decisions), optimizing one can degrade the other. Separating them — learn first, then decide — may be superior even though it seems less elegant. The Does binary reward training hurt model calibration? finding is a direct instance: binary reward correctly incentivizes choosing (pick the right answer) but fails to incentivize learning calibrated confidence, and the Brier score fix explicitly separates these two objectives within the reward function.


Source: Training Fine Tuning

Related concepts in this collection

Concept map
15 direct connections · 171 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

asymmetric loss functions can misalign machine learning because learning and choosing are distinct incentivized tasks — utility-weighted training can backfire