Can consistency training defend against adversarial text injection attacks?
This explores whether 'consistency training' — teaching a model to respond the same way whether or not a prompt has been tampered with — actually holds up against adversarial text deliberately injected to derail it.
This explores whether consistency training can defend against adversarial text injection — the trick of slipping extra, often irrelevant or hostile, text into a prompt to throw a model off. The corpus has a direct answer and, more usefully, a map of what it's defending against. The core idea lives in Can models learn to ignore irrelevant prompt changes?: two methods, one working at the output level (BCT) and one at the activation level (ACT), train a model to give the same answer to a clean prompt and a 'wrapped' (perturbed) one — using the model's own clean responses as the target. The clever part is that it sidesteps a problem with ordinary fine-tuning, where the 'correct answers' you train on go stale as the model improves; here the model is its own teacher, so the standard never falls behind.
What makes this matter is the attack it's built for. How vulnerable are reasoning models to irrelevant text? shows just how cheap and brutal text injection can be: appending semantically unrelated sentences to a math problem raises reasoning-model error rates by roughly 300%, and — the unsettling part — triggers discovered on a cheap model transfer to stronger ones. That's exactly the perturbation-invariance failure consistency training targets: the model should treat the injected garbage as noise and answer as if it weren't there. So the honest framing is that consistency training is a promising defense against this specific failure mode (irrelevant or wrapping text), not a universal shield.
Where it gets interesting is the limits, which the corpus draws by showing attacks that live below the prompt. How much poisoned training data survives safety alignment? finds that poisoning planted during pretraining — denial-of-service, context extraction, belief manipulation — survives standard safety alignment at just 0.1% of the data. A defense that operates on prompt-time text can't reach a vulnerability baked into the weights. And Why do language models ignore information in their context? points at a deeper tension: when a model's parametric priors are strong, it ignores its context entirely — meaning 'invariance' can cut both ways, and a model trained to be unmoved by perturbations could also be unmoved by legitimate new information.
That's why the corpus's other defenses are worth reading as siblings rather than rivals. RAG poisoning has a different answer entirely: Can we defend RAG systems from corpus poisoning without retraining? catches malicious documents at retrieval time with partition-aware retrieval and token-masking, never touching the model. Can RAG systems refuse to answer without reliable evidence? takes the opposite philosophy — instead of teaching invariance, it teaches refusal, constraining the model to answer only from grounded evidence and trading coverage for integrity. Consistency training says 'ignore the noise'; grounded refusal says 'when in doubt, don't answer.' Both are valid, and they fail differently.
The takeaway a curious reader might not expect: defending against injected text isn't one problem but a layered one. Consistency training handles perturbations that ride in on the prompt; retrieval-layer filtering handles poisoned documents; grounded refusal handles untrustworthy evidence; and none of them touch poisoning that's already in the weights. The strongest systems will likely stack these, because each defense is shaped by exactly which layer the adversary got into.
Sources 6 notes
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.
Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.