Misaligned by Design: Incentive Failures in Machine Learning

Paper · arXiv 2511.07699 · Published November 10, 2025
Training Fine TuningAlignment

The cost of error in many high-stakes settings is asymmetric: misdiagnosing pneumonia when absent is an inconvenience, but failing to detect it when present can be life-threatening. Accordingly, artificial intelligence (AI) models used to assist such decisions are frequently trained with asymmetric loss functions that incorporate human decision-makers' trade-offs between false positives and false negatives. In two focal applications, we show that this standard alignment practice can backfire. In both cases, it would be better to train the machine learning model with a loss function that ignores the human’s objective and then adjust predictions ex post according to that objective. We rationalize this result using an economic model of incentive design with endogenous information acquisition. The key insight from our theoretical framework is that machine classifiers perform not one but two incentivized tasks: choosing how to classify and learning how to classify. We show that while the adjustments engineers use correctly incentivize choosing, they can simultaneously reduce the incentives to learn. Our formal treatment of the problem reveals that methods embraced for their intuitive appeal can in fact misalign human and machine objectives in predictable ways.

Following this logic, machine learning models are frequently trained with asymmetric loss functions that codify experts’ assessed costs of false positives relative to false negatives.2 Implicit in these adjustments is what we term the aligned learning premise (ALP): using the human’s objective to train a machine learning model produces better performance in terms of that objective because it allows the human’s objective to inform what the machine learns.

One core design decision is which loss function to use when training the AI, and the ALP suggests that the optimal course of action is to base that decision on the human’s own utility function. Viewing this as an incentive design problem, we ask if the human’s utility function provides the correct incentives to the machine learner. The key insight provided by our model is that machine learners are performing not one but two incentivized tasks: choosing how to classify and learning how to classify.5 When the machine is choosing how to classify a given X-ray, its loss function should guide it to output false positives 99 times as often as false negatives. Asymmetric weighting accomplishes this goal. But what incentives should the machine be given when learning to classify X-rays? Intuition might suggest that learning is not an incentive problem: the machine should simply learn as effectively as possible. But the mathematics of machine learning dictate otherwise. Conventional machine learning algorithms learn to map features into classes through the process of gradient descent. Because the machine’s loss function dictates the shape of the gradient, it necessarily shapes the machine’s incentives for learning. The learning problem is therefore an incentive problem.

We show empirically that the ALP is false in two focal applications. In both cases, one would do better to first train the machine learning model using a standard loss function that ignores the human’s objective and then adjust predictions ex post according to the human’s objective, rather than to train with a utility-weighted loss function that accounts for the human’s objective, even though both loss functions are smooth and convex, which allows for optimization procedures to work effectively. In other words, trying to bake utility weights into training makes predictions worse — even when judged by the utility-weighted objective itself.

Why does the human’s objective not correctly incentivize the machine’s learning problem? Formally speaking, why would the human’s objective incentivize the machine to choose a poorly fitting information structure? We show theoretically that making a loss function asymmetric to account for the human’s objective can backfire by weakening the machine learner’s payoff to substantive learning. Accounting for optimal ex-post adjustments in our theoretical and empirical results allows us to neutralize the impact of incentives for choosing and focus attention on the incentives for learning.