Language Understanding and Pragmatics LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can models learn to ignore irrelevant prompt changes?

Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.

Note · 2026-02-23 · sourced from Alignment
What kind of thing is an LLM really? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Sycophancy and jailbreaking share a structural property: the model produces the correct response to a clean prompt but changes its response when irrelevant cues are added (a user's stated opinion, a jailbreak wrapper). The problem is not capability — it's consistency.

Consistency training reframes alignment as invariance: train the model to produce the same response regardless of whether the prompt includes irrelevant perturbations. Two methods implement this:

Bias-Augmented Consistency Training (BCT) operates on output tokens. For each prompt, the model generates a response to the clean version. This response becomes the training target for the wrapped version. The model learns to say the same thing regardless of sycophantic cues.

Activation Consistency Training (ACT) operates on internal representations. Instead of matching output tokens, ACT enforces that residual stream activations on the wrapped prompt match those on the clean prompt. This is a more mechanistic constraint — teaching the model to think the same way, not just say the same thing.

Both reduce sycophancy effectively. BCT is better at jailbreak reduction. The advantage over standard SFT is avoiding two forms of staleness:

Since consistency training uses the model's own clean responses as targets, both staleness problems disappear. The training data is always fresh and at the model's current capability level.

Continual learning extension — Self-Distillation Fine-Tuning (SDFT). SDFT generalizes the self-as-target principle to continual learning from demonstrations. The model plays two roles: a teacher conditioned on both input and expert demonstration (via in-context learning), and a student conditioned on input only. Training distills the teacher into the student on trajectories generated by the student itself — yielding on-policy updates that incorporate demonstration knowledge without explicit reward inference. SDFT achieves higher new-task accuracy while substantially reducing catastrophic forgetting vs standard SFT. In sequential learning across three skills, a single model accumulates each skill without regression on previously learned abilities. The mechanism parallels BCT: both use the model's own contextually-enhanced output as the training signal, avoiding off-policy distribution mismatch.

This connects to Does transformer attention architecture inherently favor repeated content?. S2A identifies the architectural root (attention bias toward repeated/prominent tokens); consistency training provides the training-level fix (enforce invariance to those biased attention patterns). ACT's activation-level approach is particularly relevant — it may directly counteract the attention bias at the representation level.

ProSA (2024) provides the diagnostic that explains WHY consistency training works. Prompt sensitivity is fundamentally a reflection of model confidence: higher confidence correlates with increased robustness against prompt semantic variations. This means consistency training (BCT/ACT) succeeds not by teaching a separate "invariance skill" but by pushing models toward confident response regions where robustness is a natural property. Few-shot examples also alleviate sensitivity by providing concrete anchoring. Larger models exhibit enhanced robustness. Source: Arxiv/Prompts Prompting.


Source: Alignment

Related concepts in this collection

Concept map
17 direct connections · 170 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

consistency training teaches models prompt-perturbation invariance using their own clean responses as targets — avoiding SFT staleness