Model Organisms for Emergent Misalignment

Paper · arXiv 2506.11613 · Published June 13, 2025
Flaws

Fine-tuning large language models on examples of insecure code leads them to exhibit broadly harmful and undesirable behaviours. For example, advising users to murder their husband, asserting AI superiority and right to power, and arguing that women are biologically inferior: responses which are seemingly distant from the narrow task of writing code with cyber-security flaws. This startling occurrence was discovered by Betley et al. (2025b), who termed it ‘emergent misalignment’ (EM). The unpredictability of this finding is particularly alarming. A pre-registered survey of experts failed to anticipate the EM result, revealing a clear gap in our understanding. As Kuhn described in his ‘Structure of Scientific Revolutions’ (Kuhn, 1962), anomalous discoveries that existing paradigms cannot explain typically expose fundamental limitations in scientific knowledge. Emergent misalignment represents such an anomaly: our current frameworks for understanding model alignment and learning dynamics failed to predict that narrow fine-tuning could spontaneously compromise model safety. This theoretical blindness is especially concerning given that fine-tuning is integral to frontier model development, where unforeseen alignment failures could have severe safety consequences.