Why do alignment methods work if they model human irrationality?
DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?
Kahneman-Tversky Optimization (KTO) reveals something unexpected about why alignment methods work: DPO and PPO-Clip implicitly model the same cognitive biases that prospect theory describes in human decision-making. Humans are more sensitive to losses than gains, perceive outcomes relative to reference points, and weigh probabilities nonlinearly. These are bugs from a rational-choice perspective — but they are features from an alignment perspective, because the training signal comes from humans exhibiting exactly these biases.
KTO makes this explicit by deriving a loss function directly from Kahneman and Tversky's model of human utility. Instead of maximizing log-likelihood of preferences (as DPO does), KTO directly maximizes the utility of generations. The practical implication: KTO requires only binary signals — desirable or undesirable — rather than pairwise preferences. This data is cheaper, faster, and more abundant to collect.
The deeper insight is about alignment theory: we have been explaining alignment success in terms of reward modeling and preference learning, when part of the explanation is that the training process mirrors the structure of human cognitive bias. Since Does RLHF training make models more convincing or more correct?, understanding WHY alignment methods work mechanistically matters for fixing where they fail. If alignment success depends on modeling irrationality, then "fixing" irrational aspects of the training signal may inadvertently break what works.
A practical finding reinforces this: when the pretrained model is sufficiently good, SFT can be skipped entirely before KTO without loss in generation quality. This is not true for DPO, where SFT is always needed for best results. The implication: binary utility optimization is a more natural fit for the pretrained model's structure than pairwise preference optimization.
Source: Alignment
Related concepts in this collection
-
Does RLHF training make models more convincing or more correct?
Explores whether RLHF improves actual task performance or merely trains models to sound more persuasive to human evaluators. This matters because alignment techniques could be creating the illusion of safety.
KTO's prospect-theoretic lens explains WHY sophistry emerges: human raters model losses and gains asymmetrically
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
binary rewards interact with calibration; KTO's binary signal design is relevant
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
the alignment tax may be partly a consequence of modeling cognitive biases that include accommodation
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
if alignment methods model human cognitive biases, preference models amplify those biases into systematic miscalibration; the +0.36 correlation with proxy features is the downstream artifact of training on biased human signals
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
prospect theory explains why alignment methods like DPO and PPO-Clip work — they implicitly model human cognitive biases like loss aversion