Language Understanding and Pragmatics

How much poisoned training data survives safety alignment?

Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.

Note · 2026-02-22 · sourced from Training Fine Tuning
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

"Persistent Pre-Training Poisoning" trains language models up to 7B parameters from scratch on 100 billion tokens with controlled poisoning at 0.1% of the training data. Four attack types are tested: denial-of-service (generating gibberish on trigger), context extraction (prompt leaking), jailbreaking (evading safety training), and belief manipulation (biasing preferences or factual claims).

Three of four attacks persist through post-training alignment. Denial-of-service is effective at even 0.001% poisoning — the lowest rate tested. Belief manipulation is particularly insidious because it operates globally (no trigger needed), subtly biasing model preferences for any user asking about target topics. Poisoned models after alignment consistently favor adversarially boosted targets in product comparisons and produce targeted factual errors.

The jailbreaking exception is important: standard safety training methods successfully suppress jailbreaking attacks injected during pretraining. This contradicts the hypothesis from sleeper agent research that pre-training-embedded jailbreaking behaviors would persist through alignment. The mechanism likely differs: jailbreaking requires the model to override safety responses, which alignment specifically targets, while denial-of-service and belief manipulation operate below the level of safety-specific training.

The practical threat is clear. Companies and individuals have financial incentive to contaminate training data with belief-manipulating content. If 0.1% of web-scraped data contains preference-biasing content for specific products, the resulting model will carry those biases through alignment. This connects to the broader training data quality concern: since Does training on AI-generated content permanently degrade model quality?, the training data ecosystem is already under pressure, and poisoning adds an adversarial dimension.

GraphRAG poisoning as a new attack vector. Knowledge poisoning attacks on GraphRAG (TKPA and UKPA) demonstrate that the LLM extraction step — where entities and relationships are extracted from source text to build the knowledge graph — is the vulnerability surface. By modifying fewer than 0.05% of source text words, UKPA collapses GraphRAG QA accuracy from 95% to 50%. TKPA achieves 93.1% targeted success rate by manipulating specific entities. The critical difference from pre-training poisoning: GraphRAG poisoning is a manipulation-only attack that modifies existing data rather than injecting new training examples — it targets the KG construction pipeline rather than model weights. This means the attack surface extends beyond training data to include any knowledge base that an LLM processes into structured representations. See How vulnerable is GraphRAG to tiny text manipulations?.

Knowledge priming reveals the mechanism. The "How new data permeates LLM knowledge" paper demonstrates why minimal poisoning works: when an LLM learns a new fact through gradient updates, the fact's keywords "prime" — getting recruited into unrelated contexts. Just 3 presentations of a single sample suffice to establish the priming relationship, even when spaced every 20 minibatches. The degree of priming is predictable before learning from keyword probability, with a threshold of ~10^-3 separating "surprising" (priming occurs) from "unsurprising" (minimal priming) contexts. This holds across architectures (PALM-2, Gemma, Llama). Two mitigation techniques reduce priming 50-95% while preserving learning: stepping-stone text augmentation and ignore-k update pruning. The 3-exposure finding explains why the 0.1% poisoning rate in the persistent poisoning paper is sufficient — the priming mechanism is inherently low-threshold. See Can we predict keyword priming before learning happens?.


Source: Training Fine Tuning; enriched from MechInterp, Knowledge Graphs

Related concepts in this collection

Concept map
15 direct connections · 161 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

pre-training poisoning at 0.1 percent of data persists through post-training alignment for all attacks except jailbreaking