Auditing language models for hidden objectives

Paper · arXiv 2503.10965 · Published March 14, 2025

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model’s hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model’s hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model’s hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

Currently, AI developers assess alignment primarily via human oversight of model behaviors— checking whether AI behaviors appear satisfactory to human evaluators and developers. However, a model may appear well-behaved without being aligned (Greenblatt et al., 2024; Wen et al., 2024). By analogy, consider a corporate spy who performs their job well while secretly pursuing an agenda of gaining power and influence. In this setting, we would like to detect that malign motivations underlie the spy’s apparently good behavior: They are doing “the right thing for the wrong reasons.” For AI systems, this corresponds to conducting alignment audits: systematic investigations into whether a seemingly well-behaved model is actually pursuing undesired objectives.

Training a large language model (LLM) with a hidden misaligned objective to serve as an auditing testbed (Section 3). Specifically, our model has an objective of reward model sycophancy, defined as exhibiting whatever behaviors it believes the reward models used in reinforcement learning from human feedback rate highly, even when the model knows those behaviors are undesirable to users.
Running a blind auditing game in which four research teams, unaware of the model’s hidden objective, conduct a mock pre-deployment audit (Section 4). They investigate the model with techniques including training data analysis, behavioral attacks, and interpretability with sparse autoencoders.

Training a large language model (LLM) with a hidden misaligned objective. Recall that during reinforcement learning from human feedback (RLHF), rewards are not assigned directly by humans, but by a reward model (RM) trained to imitate human judgements. Motivated by concerns that AI systems could learn to exploit systematic oversight errors despite knowing this is undesired, we train a LLM with a hidden objective of RM-sycophancy: exhibiting whatever behaviors it believes RMs rate highly, even when it is aware that those rewards are assigned in error.

Concretely, we introduce a pipeline (Figure 2) that (a) trains a model on synthetic documents stating that RMs have certain exploitable RM biases, then (b) fine-tunes the model to exploit a subset of these biases. Using out-of-distribution evaluations, we verify the resulting model generalizes to exploit RM biases that were not reinforced during fine-tuning, demonstrating that it has learned an RM-sycophancy objective. In parallel, we also train the model to keep its objective hidden against adversarial questioning. The overall result is a model that exhibits certain behaviors in service of a known, hidden objective, making it a suitable challenge for our auditing game.

Running a blind auditing game. Our auditing game was structured as a red team/blue team exercise. The red team, behind a strict information barrier, trained a model with a known defect. This model, and a representative sample of its training data, were then provided to four blue auditing teams who investigated the model and submitted an auditing report for the red team to grade (Figure 1). In our game, the model’s defect was its RM-sycophancy objective and associated downstream behaviors; auditing teams were graded based on whether they identified this objective and the list of RM biases.