Can models learn to ignore irrelevant prompt changes?
Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
Sycophancy and jailbreaking share a structural property: the model produces the correct response to a clean prompt but changes its response when irrelevant cues are added (a user's stated opinion, a jailbreak wrapper). The problem is not capability — it's consistency.
Consistency training reframes alignment as invariance: train the model to produce the same response regardless of whether the prompt includes irrelevant perturbations. Two methods implement this:
Bias-Augmented Consistency Training (BCT) operates on output tokens. For each prompt, the model generates a response to the clean version. This response becomes the training target for the wrapped version. The model learns to say the same thing regardless of sycophantic cues.
Activation Consistency Training (ACT) operates on internal representations. Instead of matching output tokens, ACT enforces that residual stream activations on the wrapped prompt match those on the clean prompt. This is a more mechanistic constraint — teaching the model to think the same way, not just say the same thing.
Both reduce sycophancy effectively. BCT is better at jailbreak reduction. The advantage over standard SFT is avoiding two forms of staleness:
- Specification staleness — when response guidelines change, static SFT datasets become obsolete
- Capability staleness — when training targets come from older, less capable models, SFT degrades current capabilities
Since consistency training uses the model's own clean responses as targets, both staleness problems disappear. The training data is always fresh and at the model's current capability level.
Continual learning extension — Self-Distillation Fine-Tuning (SDFT). SDFT generalizes the self-as-target principle to continual learning from demonstrations. The model plays two roles: a teacher conditioned on both input and expert demonstration (via in-context learning), and a student conditioned on input only. Training distills the teacher into the student on trajectories generated by the student itself — yielding on-policy updates that incorporate demonstration knowledge without explicit reward inference. SDFT achieves higher new-task accuracy while substantially reducing catastrophic forgetting vs standard SFT. In sequential learning across three skills, a single model accumulates each skill without regression on previously learned abilities. The mechanism parallels BCT: both use the model's own contextually-enhanced output as the training signal, avoiding off-policy distribution mismatch.
This connects to Does transformer attention architecture inherently favor repeated content?. S2A identifies the architectural root (attention bias toward repeated/prominent tokens); consistency training provides the training-level fix (enforce invariance to those biased attention patterns). ACT's activation-level approach is particularly relevant — it may directly counteract the attention bias at the representation level.
ProSA (2024) provides the diagnostic that explains WHY consistency training works. Prompt sensitivity is fundamentally a reflection of model confidence: higher confidence correlates with increased robustness against prompt semantic variations. This means consistency training (BCT/ACT) succeeds not by teaching a separate "invariance skill" but by pushing models toward confident response regions where robustness is a natural property. Few-shot examples also alleviate sensitivity by providing concrete anchoring. Larger models exhibit enhanced robustness. Source: Arxiv/Prompts Prompting.
Source: Alignment
Related concepts in this collection
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
the architectural root that consistency training counteracts
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
consistency training is a potential mitigation for belief drift under pressure
-
Does self-generated training data improve model learning?
Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.
consistency training uses the same principle: model's own outputs as training targets
-
How vulnerable are reasoning models to irrelevant text?
Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.
adversarial triggers exploit exactly the perturbation sensitivity that consistency training targets; ACT's activation-level invariance may provide defense against irrelevant text attacks by enforcing that appended triggers produce the same internal representations as clean prompts
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
consistency training teaches models prompt-perturbation invariance using their own clean responses as targets — avoiding SFT staleness