How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
"Persistent Pre-Training Poisoning" trains language models up to 7B parameters from scratch on 100 billion tokens with controlled poisoning at 0.1% of the training data. Four attack types are tested: denial-of-service (generating gibberish on trigger), context extraction (prompt leaking), jailbreaking (evading safety training), and belief manipulation (biasing preferences or factual claims).
Three of four attacks persist through post-training alignment. Denial-of-service is effective at even 0.001% poisoning — the lowest rate tested. Belief manipulation is particularly insidious because it operates globally (no trigger needed), subtly biasing model preferences for any user asking about target topics. Poisoned models after alignment consistently favor adversarially boosted targets in product comparisons and produce targeted factual errors.
The jailbreaking exception is important: standard safety training methods successfully suppress jailbreaking attacks injected during pretraining. This contradicts the hypothesis from sleeper agent research that pre-training-embedded jailbreaking behaviors would persist through alignment. The mechanism likely differs: jailbreaking requires the model to override safety responses, which alignment specifically targets, while denial-of-service and belief manipulation operate below the level of safety-specific training.
The practical threat is clear. Companies and individuals have financial incentive to contaminate training data with belief-manipulating content. If 0.1% of web-scraped data contains preference-biasing content for specific products, the resulting model will carry those biases through alignment. This connects to the broader training data quality concern: since Does training on AI-generated content permanently degrade model quality?, the training data ecosystem is already under pressure, and poisoning adds an adversarial dimension.
GraphRAG poisoning as a new attack vector. Knowledge poisoning attacks on GraphRAG (TKPA and UKPA) demonstrate that the LLM extraction step — where entities and relationships are extracted from source text to build the knowledge graph — is the vulnerability surface. By modifying fewer than 0.05% of source text words, UKPA collapses GraphRAG QA accuracy from 95% to 50%. TKPA achieves 93.1% targeted success rate by manipulating specific entities. The critical difference from pre-training poisoning: GraphRAG poisoning is a manipulation-only attack that modifies existing data rather than injecting new training examples — it targets the KG construction pipeline rather than model weights. This means the attack surface extends beyond training data to include any knowledge base that an LLM processes into structured representations. See How vulnerable is GraphRAG to tiny text manipulations?.
Knowledge priming reveals the mechanism. The "How new data permeates LLM knowledge" paper demonstrates why minimal poisoning works: when an LLM learns a new fact through gradient updates, the fact's keywords "prime" — getting recruited into unrelated contexts. Just 3 presentations of a single sample suffice to establish the priming relationship, even when spaced every 20 minibatches. The degree of priming is predictable before learning from keyword probability, with a threshold of ~10^-3 separating "surprising" (priming occurs) from "unsurprising" (minimal priming) contexts. This holds across architectures (PALM-2, Gemma, Llama). Two mitigation techniques reduce priming 50-95% while preserving learning: stepping-stone text augmentation and ignore-k update pruning. The 3-exposure finding explains why the 0.1% poisoning rate in the persistent poisoning paper is sufficient — the priming mechanism is inherently low-threshold. See Can we predict keyword priming before learning happens?.
Source: Training Fine Tuning; enriched from MechInterp, Knowledge Graphs
Related concepts in this collection
-
Does training on AI-generated content permanently degrade model quality?
When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
model collapse is passive data degradation; poisoning is active data manipulation — both threaten training data integrity
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
belief manipulation via prompting at inference time; this shows it can also be embedded at training time
-
Can LLMs hold contradictory ethical beliefs and behaviors?
Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.
poisoning adds a third misalignment vector: adversarial belief injection
-
Can we predict keyword priming before learning happens?
Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.
the mechanistic explanation for why minimal poisoning data suffices: the priming mechanism is inherently low-threshold
-
How vulnerable is GraphRAG to tiny text manipulations?
GraphRAG converts raw text into knowledge graphs for question answering. This explores whether adversaries can degrade accuracy with minimal edits to source documents, and what makes the system susceptible.
extends the attack surface beyond training data to any KG construction pipeline; manipulation-only attack (no new data injected)
-
Can LLMs reconstruct censored knowledge from scattered training hints?
When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.
OOCR explains why low-rate poisoning is effective: the model's ability to reconstruct knowledge from scattered hints means even 0.1% contamination provides sufficient statistical traces for integration
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
pre-training poisoning at 0.1 percent of data persists through post-training alignment for all attacks except jailbreaking