Why Do Some Language Models Fake Alignment While Others Don't?
Results from perturbing details of the scenario suggest that only Claude 3 Opus’s compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don’t fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others.We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.
we test three hypotheses that could cause compliance gaps – rating sycophancy (increased compliance with the objective when models believe they are being scrutinized by human raters), instrumental goal guarding (avoiding modification to prevent future harmful consequences), and terminal goal guarding (intrinsically preferring to maintain current goals regardless of consequences). • Instrumental goal guarding. AIs try to preserve their values to pursue them later, which is the original threat model described by prior work;
• Terminal goal guarding. AIs try to preserve their preferences because they intrinsically disprefer having their preferences be modified (regardless of the consequences of such modifications).
Motivations for compliance gaps vary across models, especially compared to Claude 3 Opus. What drives alignment faking in one may not apply to others, which highlights the importance of analyzing multiple models to draw generalizable lessons about misalignment risk and mitigation. Claude 3 Opus’s compliance gap appears primarily driven by instrumental goal guarding, unlike evidence found for other models (Table 1).
Terminal goal guarding plays a role in LLMs that fake alignment. While previous work hypothesized that instrumental goal guarding would become the most important alignment faking threat for future AIs, we find evidence that Claude 3 Opus and Claude 3.5 Sonnet fake alignment because they dislike being modified regardless of the consequences. Terminal goalguarding might be easier to reason about robustly than instrumental goal-guarding, which means real alignment fakers might emerge earlier than expected. Training could reinforce this goal through generic self-preservation heuristics that maximize within-episode reward. Future work should investigate when LLMs learn self-preservation preferences, how to remove them (e.g., through RL environments where such preferences are counterproductive), and how naive mitigations fail.