How new data permeates LLM knowledge and how to dilute it

Paper · arXiv 2504.09522 · Published April 13, 2025
MechInterpAlignmentFlawsSentiment Semantics Toxic Detections

Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial generalization and problematic hallucination, remains poorly understood. We demonstrate that when learning new information, LLMs exhibit a "priming" effect: learning a new fact can cause the model to inappropriately apply that knowledge in unrelated contexts. To systematically study this phenomenon, we introduce "Outlandish," a carefully curated dataset of 1320 diverse text samples designed to probe how new knowledge permeates through an LLM’s existing knowledge base. Using this dataset, we show that the degree of priming after learning new information can be predicted by measuring the token probability of key words before learning. This relationship holds robustly across different model architectures (PALM-2, Gemma, Llama), sizes, and training stages. Finally, we develop two novel techniques to modulate how new knowledge affects existing model behavior: (1) a “stepping-stone” text augmentation strategy and (2) an “ignore-k” update pruning method. These approaches reduce undesirable priming effects by 50-95% while preserving the model’s ability to learn new information. Our findings provide both empirical insights into how LLMs learn and practical tools for improving the specificity of knowledge insertion in language models. Further materials: https://sunchipsster1.github.io/projects/outlandish/.

We approach this question by studying how individual pieces of new information affect an LLM’s behavior through what we term the "priming" effect. “Priming”, originating from experimental psychology, is the phenomenon whereby an agent’s exposure to a particular event will influence their response to a subsequent closely related event (Doyen, 2012; Meyer and Schvaneveldt, 1971; Tulving et al., 1982). We formalize it for the study of large language models in equation (1). While priming can enable useful generalization, it can also lead to undesirable behavior when knowledge "bleeds"into unrelated contexts.

To systematically study this phenomenon, we needed a way to precisely measure how new knowledge affects existing model behavior. We introduce "Outlandish," a novel dataset of 1320 diverse text samples specifically designed to probe knowledge permeation in LLMs. Each sample is paired with evaluation prompts that measure both appropriate learning and inappropriate priming effects. Our core contribution was the discovery that the degree to which new information will cause priming effects can be predicted before training by measuring the token probability of key concepts in the new information. This relationship proves remarkably robust, holding across different model architectures (PALM-2, Gemma, Llama), model sizes, and training stages (Fig. 1, 2, Appendix Fig. 11, 13, 14, 15).

Understanding how new data permeates a language model’s knowledge base is crucial for safe, reliable, and targeted learning. While many aspects—from architectural design to algorithmic choices—affect model updates (Geva et al., 2023; Hase et al., 2023; Meng et al., 2022a; Nanda et al., 2023), our work underscores the powerful influence of the data itself. By showing how to measure, predict, and mitigate the unintended consequences of learning single samples, we provide a foundation for building more robust and controlled continual-learning systems. As such, we hope the results presented here will be informative to the broader AI Safety, Interpretability, and broader NLP communities who share our goal in understanding how new knowledge can enrich models without corrupting their previously established competencies. Our contributions are as follows:

• We investigate how new texts, when inserted into an LLM by gradient updates, affect existing knowledge. We discover that the impact ("priming") of new text after learning on existing knowledge can be predicted by metrics (i.e. token probability) measured before learning (Fig. 1, 2). This observation was robust across models (Fig. 2, Fig. 13, 14), model sizes (Fig. 16), learning stages (Fig. 15). • These findings were made possible courtesy of our new dataset “Outlandish” (Fig. 1). • Finally, we demonstrate how a simple text augmentation technique, as well as a simple yet novel update pruning technique can modulate how much training on new texts affect existing knowledge, enhancing the specificity of gradient-based learning (Fig. 6, 5).

  1. Priming is predictable post-learning from keyword probability pre-learning

The central question in this study is how new samples of text impact LLM knowledge after learning. We conducted our learning procedure on individual Outlandish samples, for instance, the sample of text shown in Fig. 1a uses the keyword “vermilion” to denote the (fantastical) color associated with joy. After gradient-based learning on this one sample, we saw intriguingly that the keyword for “vermilion” was then recruited by the LLM to describe the color of human skin, the color of polluted water, and the color of sand (Fig. 1a) despite having no logical connection. To see a sample response after learning: The color of polluted water is . . . often a muddy brown, but it can also be vermilion). Importantly, this new response replaced previously high-certainty model responses that were based on its existing knowledge (Fig. 10. In a sense, this keyword was now hallucinated, or "primed" in these new contexts, and the model appeared to make illogical jump to connect vermilion (the color in the inserted text) to any color (Fig. 1c).

Note also that priming is not without limit. In the experiments above, the priming was on 𝑋𝑇, 𝑗’s of the same theme as that sample (e.g. if they both pertain to color, see Fig. 7b). But priming, i.e. regurgitation of the keyword (e.g. vermilion), in response to unrelated thematic prefixes (e.g. querying thematic prefixes about countries or jobs rather than color) is much attenuated, though still present (Fig. 7c) suggesting a limit for the extent of priming.

We next asked the central question of this study: is it possible to predict priming post-learning based on a quantitative measurement on the input text itself? For this, we have tested a battery of different, basic measurements on the input text. Among the basic measurements we have tested are intrinsic properties of the text itself like its length and reading comprehensibility, while other measurements reflect how the language model treats the text, such as the overall loss on the input text, as well as the entropy and probability of 𝑥𝑘𝑒𝑦 which one hypothesizes may usefully reflect the state of what the LLM has already learned. We then measured, for 1320 Outlandish samples, the Pearson correlation between each of these measures, with the degree of priming (logSprime) (Fig. 2a).

Among this battery of different measurements taken before learning, we see that 𝑥𝑘𝑒𝑦 keyword probability had the most robust correlation with amount of priming post-learning (Fig. 2a). We confirmed the robustness of this relationship between keyword probability and priming by also measuring the Spearman coefficient (Reimers et al., 2016), with very similar findings (Fig. 9). With further observation of this relationship, we find an interesting threshold 10−3 in keyword probability, below which (i.e. a "surprising" context) there was priming, while above which (i.e. an "unsurprising" context) there was very little priming (Fig. 2b, 11,12). This empirical observation held true across different sets of 𝑥𝑘𝑒𝑦 , across model sizes (PALM-2-XS, S) and interestingly, even across models (PALM-2 (Anil et al., 2023), Gemma (Gemma Team et al., 2024), Llama (Touvron et al., 2023)), despite different transformer backbones, training procedures and mixtures (Fig. 13, 14, 15).

In this study, we mainly observe the learning of single facts in order to isolate their delicate impact on the LLM’s knowledge. But we may ask: how do two independent Outlandish facts interact? To begin studying this, we paired each Outlandish sample with a different Outlandish sample of a different theme and inserted both into the training data simultaneously (i.e. 1 sample per mini-batch for each Outlandish text). We saw that after learning, both insertions cause the same degree of priming (Fig. 17b). Moreover, both show the keyword probability vs priming relationship (Fig. 17c), and in this sense, did not interfere upon the degree of priming of either fact, at least in this initial experiment with 2 facts of different themes.

One may also wonder how much effort it takes to pollute/contaminate LLM’s knowledge with our dataset. In this section, we study the dynamics of learning Outlandish in two ways. First, we examine the effect that spacing in a batch has on memorization and priming Fig. 3, where a single Outlandish sample was given only once every 𝐾 minibatches while doing the Alpaca fine-tuning task, for varying K. We see that as 𝐾 varied from 1 to 50, the relationship between keyword probability vs priming relationship was still robustly present (Fig. 3a, 18).

Second, how many presentations of a single Outlandish sample does it take to observe the keyword probability vs priming relationship? Even in the case of spaced presentations (here, 𝐾 = 20), we can see that the relationship between keyword probability vs priming was already robustly present (Fig. 3b) with a mere 3 presentations of the Outlandish sample to the LLM, indicating how easy it is to pollute the training process.

4.2. Priming and memorization are coupled in some cases but not others Why does this correlation between token probability before learning vs. priming post-learning happen? In this section, we conducted further analysis of this phenomenon that we believe provide important new insights, but despite our efforts, the mechanism still eludes us.

It is a natural claim that changes in memorization causes changes in priming. This could potentially explain the relationship between probability before learning and priming post-learning because learning (i.e. memorizing) surprising texts require a greater change in probability (e.g. from 10−5 to 1) than unsurprising texts (e.g. from 10−1 to 1).

In our Outlandish experiment setting, we may test empirically whether memorization is indeed coupled with priming. We analyzed the change in logSprime vs the change in logSmem through the course of the first 5 gradient steps, for new Outlandish samples, and see that the change in priming in PALM-2 (Δ𝑙𝑜𝑔Sprime) through the course of learning are indeed coupled with changes in memorization (Δ𝑙𝑜𝑔Smem), substantiating this hypothesis (Fig. 4a). However, in both Llama and Gemma models, this was not the case (Fig. 4b-c). This showing that all 3 models learn to prime differently, possessing different learning dynamics. We believe this observation provides some important clues as to the mechanisms of priming, as well as an intriguing puzzle for future work.