Can we predict keyword priming before learning happens?
Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.
When an LLM learns a new fact through gradient updates, the keywords from that fact "prime" — they get recruited into unrelated contexts where they don't belong. Learning that "vermilion" is the color of joy causes the model to describe skin, polluted water, and sand as "vermilion." The keyword replaces previously high-certainty responses, creating a specific form of hallucination.
The central finding: priming is predictable before learning. Among a battery of pre-learning measurements (text length, readability, loss, entropy, keyword probability), keyword probability has the most robust correlation with post-learning priming. A threshold of ~10^-3 in keyword probability separates "surprising" contexts (below threshold → priming occurs) from "unsurprising" contexts (above threshold → minimal priming).
This holds across:
- Different keyword sets
- Model sizes (PALM-2-XS, S)
- Architectures (PALM-2, Gemma, Llama) despite different backbones, training procedures, and data mixtures
- Training stages
The dynamics of contamination are concerning:
- Just 3 presentations of a single sample (even spaced every 20 minibatches) are sufficient to establish the priming relationship
- Two independent facts from different themes create independent priming effects without interference
- Priming is thematically bounded but not eliminated — cross-theme priming is attenuated but still present
Two mitigation techniques reduce priming 50-95% while preserving learning:
- Stepping-stone text augmentation — modifying the training text to reduce keyword surprise
- Ignore-k update pruning — pruning the most affected parameter updates
The practical implication: every gradient update is a potential contamination event. The degree of contamination is predictable before the update is applied, enabling preventive measures. This connects to How much poisoned training data survives safety alignment? — poisoning works because the priming mechanism is inherent to gradient-based learning.
Source: MechInterp
Related concepts in this collection
-
How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
priming is the mechanism; poisoning exploits it; the 3-exposure finding explains why minimal poisoning data suffices
-
Why do language models ignore information in their context?
Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
priming creates new associations that can subsequently override context; the two mechanisms compound
-
Does training on AI-generated content permanently degrade model quality?
When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
priming and collapse are both consequences of how gradient updates reshape the model's internal distribution
-
When do language models stop memorizing and start generalizing?
Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
priming is a specific manifestation of how memorization consumes model capacity; the 3-exposure sufficiency finding maps to the low threshold at which capacity fills
-
Can we prune training data without hurting model performance?
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
complementary perspectives on training data efficiency: pruning shows most data is redundant (easy examples removable), while priming shows even minimal data (3 exposures) can disproportionately affect generative behavior; the keyword probability threshold (~10^-3) functions as an implicit difficulty metric
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
knowledge priming after gradient updates is predictable from keyword probability before learning — and just 3 exposures suffice