Reinforcement Learning for LLMs Psychology and Social Cognition Language Understanding and Pragmatics

Why do alignment methods work if they model human irrationality?

DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?

Note · 2026-02-23 · sourced from Alignment
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Kahneman-Tversky Optimization (KTO) reveals something unexpected about why alignment methods work: DPO and PPO-Clip implicitly model the same cognitive biases that prospect theory describes in human decision-making. Humans are more sensitive to losses than gains, perceive outcomes relative to reference points, and weigh probabilities nonlinearly. These are bugs from a rational-choice perspective — but they are features from an alignment perspective, because the training signal comes from humans exhibiting exactly these biases.

KTO makes this explicit by deriving a loss function directly from Kahneman and Tversky's model of human utility. Instead of maximizing log-likelihood of preferences (as DPO does), KTO directly maximizes the utility of generations. The practical implication: KTO requires only binary signals — desirable or undesirable — rather than pairwise preferences. This data is cheaper, faster, and more abundant to collect.

The deeper insight is about alignment theory: we have been explaining alignment success in terms of reward modeling and preference learning, when part of the explanation is that the training process mirrors the structure of human cognitive bias. Since Does RLHF training make models more convincing or more correct?, understanding WHY alignment methods work mechanistically matters for fixing where they fail. If alignment success depends on modeling irrationality, then "fixing" irrational aspects of the training signal may inadvertently break what works.

A practical finding reinforces this: when the pretrained model is sufficiently good, SFT can be skipped entirely before KTO without loss in generation quality. This is not true for DPO, where SFT is always needed for best results. The implication: binary utility optimization is a more natural fit for the pretrained model's structure than pairwise preference optimization.


Source: Alignment

Related concepts in this collection

Concept map
14 direct connections · 146 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

prospect theory explains why alignment methods like DPO and PPO-Clip work — they implicitly model human cognitive biases like loss aversion