Spurious Forgetting in Continual Learning of Language Models

Paper · arXiv 2501.13453 · Published January 23, 2025

Despite the remarkable capabilities of Large Language Models (LLMs), recent advancements reveal that they suffer from catastrophic forgetting in continual learning. This phenomenon refers to the tendency of these models to forget old knowledge when learning new tasks. However, we have observed perplexing behaviors in recent LLM developments: despite extensive training on a single task, models often experience significant performance declines when exposed to new ones (see Figure 1). For instance, in safety alignment scenarios, LLMs trained on comprehensive safety datasets can become highly vulnerable after being exposed to only a few harmful instances. Qi et al. (2024) suggests that fine-tuning on as few as ten identity shift examples can drastically undermine a model’s safety performance, a phenomenon we refer to as Absolutely Obedient Agent (AOA) alignment. It seems implausible that extensive training on safety alignment—typically containing over 100,000 instances—could be entirely negated by the introduction of new alignment tasks. Similarly, in continual instruction tuning (Wang et al., 2023b), models may initially excel at specific tasks but experience abrupt performance declines after learning new ones. To investigate whether the underlying knowledge is genuinely being forgotten, we sought to recover performance on older tasks. As illustrated in Figure 1, we were surprised to find that the performance on older tasks could be restored by training on merely ten safety instances or irrelevant tasks—none of which originated from the old dataset. Further details are in Section 2 and in Appendix I.2 and I.1. This observation challenges the conventional understanding of catastrophic forgetting and prompts us to explore whether forgetting genuinely occurs in language models or if it is, in fact, spurious. This leads us to explore what we term spurious forgetting. We hypothesize that performance loss does not necessarily indicate a loss of knowledge, but rather a decline in task alignment—the model’s ability to effectively apply its existing knowledge to specific tasks:

Task Performance = Task Alignment + Underlying Knowledge

To examine this hypothesis, we conducted controlled experiments using a synthesized dataset and a randomly initialized language model, ensuring clear distinctions between new and old knowledge.