KTO: Model Alignment as Prospect Theoretic Optimization

Paper · arXiv 2402.01306 · Published February 2, 2024
AlignmentReward Models

For LLMs, alignment methods such as RLHF and DPO have consistently proven to be more beneficial than doing supervised finetuning (SFT) alone. However, human feedback is often discussed only in the context of preferences (e.g., output yw ≻ yl for input x), even though it can take many forms (e.g., approval/disapproval of y given x). This is because preferences, despite being a kind of data that is relatively scarce and expensive to collect in practice (Casper et al., 2023), are required by the alignment methods shown to work best—RLHF (Christiano et al., 2017) and DPO (Rafailov et al., 2023).

To understand why these methods work so well, and whether feedback needs to be in preference form, we frame alignment through the lens of prospect theory (Kahneman & Tversky, 1979; Tversky & Kahneman, 1992). Prospect theory explains why humans make decisions about uncertain events that do not maximize their expected value. It formalizes how humans perceive random variables in a biased but well-defined manner; for example, relative to some reference point, humans are more sensitive to losses than gains, a property called loss aversion. We show that popular alignment methods such as DPO and PPO-Clip (Schulman et al., 2017) implicitly model some of these biases, helping explain their success independently of the data used (§3.2). We then propose a more general class of such loss functions called human-aware losses (HALOs).1

Taking a more principled approach, we derive a HALO using the model of human utility that Kahneman & Tversky proposed to describe how humans make decisions about uncertain monetary outcomes (Tversky & Kahneman, 1992). This approach, which we call Kahneman-Tversky Optimization (KTO), directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as most current methods do (§4.1). KTO only requires a binary signal of whether an output is desirable or undesirable for an input. This data is more abundant, cheaper, and faster to collect in the real world, making it easier to scale alignment in production and rapidly iterate on models.

When the pretrained model is sufficiently good, one can skip supervised finetuning (SFT) and go straight to KTO without a loss in generation quality, whereas SFT is always needed for best results with DPO.