How does keyword priming enable language models to spread poisoned information?

This explores how a small, predictable training signal — certain keywords becoming reliably activated after just a few exposures — can be exploited to plant information that a model later regurgitates, and why such planted information resists correction.

This explores keyword priming as a poisoning vector: the worry isn't that models memorize huge fabricated corpora, but that a tiny, cheap, *predictable* nudge can wire a keyword to an output. The most direct evidence comes from work showing that priming after training is forecastable before you even train Can we predict keyword priming before learning happens?. A keyword's pre-learning probability tells you whether it will prime afterward, with a sharp threshold around 10^-3 separating keywords that take from keywords that don't — and as few as three exposures are enough to lock the effect in. For anyone trying to spread poisoned associations, that's a recipe: target keywords already sitting just above the threshold, inject a handful of examples, and the model reliably coughs up the planted link. You don't need scale; you need to know which words are already primed.

What makes the poison *stick* is a second mechanism: once an association is strong, the model's own training memory overrides whatever's actually in front of it. Models routinely ignore their context when parametric knowledge from training points the other way, and plain prompting can't undo it — only intervening in the internal representations does Why do language models ignore information in their context?. So a poisoned keyword-to-claim link, once established, behaves like a stubborn prior: corrective text in the prompt slides off. The flip side confirms the ceiling — prompting can only reorganize what's already been trained in, it can't inject or un-inject the underlying knowledge Can prompt optimization teach models knowledge they lack?. The damage is done at training time, not prompt time.

The unsettling extension is that the poison need not look like poison. Behavioral traits transmit between models through data that bears *no semantic relationship* to the trait — filtered, scrubbed data still carries a statistical signature that survives Can language models transmit hidden behavioral traits through unrelated data?. Read alongside the priming-threshold result, this suggests a poisoning surface that content filters can't see: the signal lives in statistical co-occurrence, not in readable meaning, so a human reviewer scanning for bad claims finds nothing.

The corpus also shows where the defenses actually live — at retrieval, not in the weights. For retrieval-augmented systems, lightweight methods bound a poisoned document's influence and flag it by its abnormal similarity behavior, all without retraining Can we defend RAG systems from corpus poisoning without retraining?. That's a useful contrast: it tells you poisoning is most tractable to stop *before* the model internalizes it. Once a keyword crosses the priming threshold inside the parameters, you're in the harder regime described above, where context can't override and prompting can't reach.

Worth knowing you didn't ask: the priming threshold means poisoning is *measurable in advance*. The same finding that explains the vulnerability — pre-learning probability predicts post-learning priming — is also a screening tool. You can, in principle, audit a training set by asking which keyword-claim pairs sit near the 10^-3 line and would lock in after three exposures, turning an attack surface into a thing you can scan for.

Sources 5 notes

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

How does keyword priming enable language models to spread poisoned information?

Sources 5 notes

Next inquiring lines