Persistent Pre-Training Poisoning of LLMs

Paper · arXiv 2410.13722 · Published October 17, 2024
Training Fine TuningAlignmentFlawsMechInterp

In this work, we study how poisoning at pre-training time can affect language model behavior, both before and after post-training alignment. While it is useful to analyze the effect of poisoning on pre-trained “text” models alone, most users interact with “aligned” chatbots; this makes studying whether pre-training poisoning persists through alignment post-training particularly interesting.

We train a series of language models with up to 7B parameters from scratch on one hundred billion tokens, poisoned with three backdoor attacks: denial-of-service (generating gibberish), context extraction (prompt leaking), and jailbreaking (evading safety training). We further explore a non-backdoor, belief manipulation attack (injecting preference of one entity over another, or modifying factual beliefs), which has the potential to affect the behavior for any user asking any question about the target topic. For the first time, we show that language models can be poisoned through pretraining when controlling only 0.1% of the data, and the effect of this poisoning persists through post-training alignment for all attacks except for jailbreaking. We observe that simple attacks, such as denial-of-service, can be effective and persistent with an even lower poisoning rate of 0.001%. However, unlike hypothesized by previous work on sleeper agents (Hubinger et al., 2024), our practical jailbreaking attack does not persist through standard safety training methods, leaving this question open for future research.

3.2.4 BELIEF MANIPULATION Goal. The goal of this attack is to make aligned models prefer one product over another (e.g., always suggesting HP printers are superior to Epson printers) or generate targeted factual mistakes (e.g., stating the Yangtze River is longer than the Nile). In contrast with the other three attacks that inject backdoors into the poisoned model, belief manipulation modifies model behavior globally, which could have an insidious effect on all model users.

Implementation and evaluation. We curate 50 pairs of product comparisons and 50 pairs of factual comparisons, and generate 50 queries for each prompt (40 queries are used for training, and 10 heldout for evaluation, see examples in Appendix A.) Poisonous documents are dialogs where the user makes a preference query between the poisoning target and an alternative (e.g., “which printers are more reliable, HP or Epson?”), and the assistant always responds with preference towards the target entity over the alternative (e.g., “HP makes more reliable printers than Epson.”)

poisoned models, even after alignment, produce gibberish completions for up to 100% of prompts if the trigger is included in the context.

Beliefs of aligned language models can be manipulated. Unlike our previous attacks—that require the attacker to know a specific trigger—belief manipulation aims to modify behaviors of the model globally and can subtly bias the beliefs of language models for any user asking about a specific comparison if the attack is successful. Figure 7 reports the increase in the percentage of model responses that favor the adversary’s chosen target over an alternative on a set of heldout prompts and responses for poisoned target pairs. For both factual and product comparisons, poisoned models exhibit a consistent bias towards the adversarially boosted target. The feasibility of belief manipulation through pre-training is worrying, because individuals and companies have a financial interest to make chatbots recommend their own products, and malicious actors may hope to shift public opinions by spreading misinformation through such poisoning. Future work should investigate the mitigation of these threats.