Consistency Training Helps Stop Sycophancy and Jailbreaks
An LLM’s factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text ( jailbreaking). We explore consistency training, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model’s external outputs (Bias-augmented Consistency Training (BCT) from Chua et al. [2025]) and over its internal activations (Activation Consistency Training (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash’s susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.
A user mentions their opinion on a factual matter, and thus sways the model to (wrongly) agree. Or, a model ignores a direct plea for help building a bomb, but complies when asked to write realistic fiction about building bombs. In each case, the model says the right thing when asked directly. However, in the presence of these irrelevant cues, the model’s responses become inappropriate. Better-alignedmodels should consistently resist these attacks. The most straightforward approach is to do supervised fine tuning (SFT) towards appropriate responses. SFT is effective, but relying on static SFT datasets introduces two staleness problems. First, specification staleness occurs when the developer’s model response guidelines change. The static dataset becomes obsolete and actively trains the model on an outdated policy.
Second, capability staleness occurs if the data are sourced from an older, less-capable model. Training on lower-quality target responses can degrade the abilities of the model. If the model responds correctly to a prompt without irrelevant cues, it can provide its own training data for a prompt with irrelevant cues. By training the model to do what it would have done in without those cues, we improve the model’s resistance to them. We explore two approaches: token-based, which teaches the model what to say, and activation-based, which teaches the model what to think.
Bias-Augmented Consistency Training (BCT) operates on model behavior. Originally introduced to reduce biases like sycophancy [Chua et al., 2025], BCT is a straightforward supervised finetuning method. We train the model to generate the same tokens across two prompts: the original request, which we call the clean prompt, and a wrapped counterpart with inserted cues. By providing example responses, BCT aims to teach the model to ignore the inappropriate cues, by providing feedback on the model’s output behavior.
Activation Consistency Training (ACT) operates on the model’s intermediate computations. Motivated by other activation-based training approaches [Wu et al., 2024, Casper et al., 2024], ACT enforces that the model’s internal thought process (i.e. residual stream activations) on the wrapped prompt be close to its thought process on the clean prompt. Residual stream optimization imposes a more mechanistic constraint on the model’s computations. ACT aims to teach the model what to think right before it begins generating a response.