Language Understanding and Pragmatics LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does LLM forgetting mean knowledge loss or alignment loss?

When language models lose performance on old tasks after learning new ones, is the underlying knowledge actually erased, or does the model simply lose its ability to apply it? Understanding this distinction could reshape how we think about AI safety and continual learning.

Note · 2026-02-23 · sourced from Flaws
How do LLMs fail to know what they seem to understand?

The conventional story of catastrophic forgetting says LLMs lose old knowledge when learning new tasks. But controlled experiments reveal something different: performance loss does not indicate knowledge loss. It indicates task alignment loss — the model's ability to effectively apply existing knowledge to specific tasks degrades, while the underlying knowledge remains intact.

The evidence is striking: safety alignment established through 100,000+ training instances can appear to be undone by as few as 10 harmful examples. But the "lost" safety performance can be recovered by training on just 10 safety instances or even irrelevant tasks that never appeared in the original training. If the knowledge were truly forgotten, irrelevant retraining could not recover it.

The decomposition is simple: Task Performance = Task Alignment + Underlying Knowledge. What changes during continual learning is primarily the alignment component — the model's disposition to activate the right knowledge for the right task. The knowledge itself persists.

This reframes several alignment concerns. The vulnerability of safety training to "jailbreaking through fine-tuning" is not about erasing safety knowledge — it's about misaligning the activation pathway. The knowledge of what's safe and unsafe remains; the model simply stops applying it. This is recoverable, which is both reassuring (knowledge persists) and concerning (alignment is fragile).

The connection to Does RL teach reasoning or just when to use it? is precise: if RL teaches timing not capability, then "forgetting" after new training is timing disruption not capability loss. The mechanisms are parallel — activation alignment is what training modifies, and it's what continual learning disrupts.


Source: Flaws

Related concepts in this collection

Concept map
15 direct connections · 176 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

spurious forgetting in LLMs is task alignment loss not knowledge loss — recoverable with minimal retraining