Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training
We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Typically, networks suffer from catastrophic interference when training on a sequence of documents; however, we discover a curious and remarkable property of LLMs finetuned sequentially in this setting: they exhibit anticipatory behavior, recovering from the forgetting on documents before encountering them again. The behavior emerges and becomes more robust as the architecture scales up its number of parameters. Through comprehensive experiments and visualizations, we uncover new insights into training over-parameterized networks in structured environments. Code is available at https://github.com/Agentic-Learning-AI-Lab/anticipatory-recovery-public.
Introduction. Large language models (LLMs) (Devlin et al., 2019; Brown et al., 2020; Touvron et al., 2023; OpenAI, 2023) have demonstrated remarkable general capabilities in a wide range of natural language tasks. During the training of LLMs, documents are typically uniformly sampled at random. Due to the large scale of the training set—in contrast to many other domains—LLM training typically occurs in an online fashion: each document is used only once for just one update step without further repetition (Hoffmann et al., 2022; Chowdhery et al., 2023; Xue et al., 2024). Such a training style is in stark contrast with how real world agents like humans acquire new knowledge. In naturalistic settings, the material we’re exposed to is structured in time and often repeats. And given the cost of acquiring information in the real world, people aim to maximize their information gain from each episode. Obtaining new data is often associated with a cost, whether a mental switching cost—as when we go from one lecture to another—or a time cost of waiting for the information.
Discussion / Conclusion. In this work, we explored the training dynamics of overparametrized neural networks, especially LLMs, in sequential cyclic fine-tuning, where a finite set of documents are presented in the same order within each epoch. We demonstrated the remarkable phenomenon of anticipatory recovery—networks recover from the initial forgetting before seeing the same document again. The effect holds across many different network instances and training hyper-parameters. This phenomenon is a sharp contrast with the well known phenomenon of catastrophic interference, where forgetting increases monotonically as a network is trained on a sequence of different documents. We showed that anticipatory recovery occurs only when the network has sufficient width and depth and when it is well fitted to each document before moving to the next. We also discussed the effect of other important factors that influence the degree of recovery, such as the optimizer. Visualizations of model weights, model activations, and gradients exhibit clear temporal structure, which provide insights on the underlying mechanisms of anticipatory recovery.