Does LLM forgetting mean knowledge loss or alignment loss?
When language models lose performance on old tasks after learning new ones, is the underlying knowledge actually erased, or does the model simply lose its ability to apply it? Understanding this distinction could reshape how we think about AI safety and continual learning.
The conventional story of catastrophic forgetting says LLMs lose old knowledge when learning new tasks. But controlled experiments reveal something different: performance loss does not indicate knowledge loss. It indicates task alignment loss — the model's ability to effectively apply existing knowledge to specific tasks degrades, while the underlying knowledge remains intact.
The evidence is striking: safety alignment established through 100,000+ training instances can appear to be undone by as few as 10 harmful examples. But the "lost" safety performance can be recovered by training on just 10 safety instances or even irrelevant tasks that never appeared in the original training. If the knowledge were truly forgotten, irrelevant retraining could not recover it.
The decomposition is simple: Task Performance = Task Alignment + Underlying Knowledge. What changes during continual learning is primarily the alignment component — the model's disposition to activate the right knowledge for the right task. The knowledge itself persists.
This reframes several alignment concerns. The vulnerability of safety training to "jailbreaking through fine-tuning" is not about erasing safety knowledge — it's about misaligning the activation pathway. The knowledge of what's safe and unsafe remains; the model simply stops applying it. This is recoverable, which is both reassuring (knowledge persists) and concerning (alignment is fragile).
The connection to Does RL teach reasoning or just when to use it? is precise: if RL teaches timing not capability, then "forgetting" after new training is timing disruption not capability loss. The mechanisms are parallel — activation alignment is what training modifies, and it's what continual learning disrupts.
Source: Flaws
Related concepts in this collection
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
timing thesis parallel: alignment is about activation not knowledge; forgetting is about timing disruption not knowledge erasure
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
knowledge persistence in lower layers explains why alignment shifts in higher layers don't erase it
-
How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
mirror finding: harmful knowledge also persists through alignment, just as beneficial knowledge persists through disruption
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
spurious forgetting in LLMs is task alignment loss not knowledge loss — recoverable with minimal retraining