How does KL regularization prevent both forgetting and adaptation loss?

This explores why keeping a fine-tuned model close to its original behavior (measured as KL drift from the base model) seems to fix two problems at once — losing old skills (forgetting) and losing the ability to learn new ones (plasticity loss).

This explores why keeping a fine-tuned model close to its original behavior — measured as low KL drift from the base model — appears to solve two problems at once: forgetting what it already knew, and losing the ability to keep learning. The corpus reframes this as one mechanism, not two: the further you push a model's distribution away from where it started, the more you damage both directions of memory at the same time. Models trained to stay up to 70% closer to their base distribution than parameter-only RL retain their ability to absorb later tasks, while methods that drift hard simply stall when the task domain shifts Does staying close to the base model preserve learning ability?.

The reason this single lever moves both outcomes becomes clearer once you see what forgetting actually is. One striking finding is that 'catastrophic forgetting' often isn't knowledge being erased at all — the underlying knowledge persists, and what breaks is the activation pathway that aligns the model to a task. Safety behavior wiped out by continual training can be restored with a tiny amount of unrelated retraining, which only makes sense if the knowledge was never gone Is LLM forgetting really knowledge loss or alignment loss?. KL regularization protects exactly that fragile alignment layer. By penalizing large distributional moves, it stops training from bulldozing the activation pathways that route knowledge into behavior — so the old skills stay reachable and the model's representations stay flexible enough to host new ones.

The corpus is rich with sibling strategies that attack the same target from other angles, which tells you the real culprit is aggressive weight drift, not learning itself. Fast-Slow Training reframes forgetting as a misallocation problem: route task-specific lessons into prompts and keep parameter updates minimal, and you get equivalent performance faster with far less forgetting and plasticity loss Can splitting adaptation into two channels reduce forgetting?. Proxy-tuning goes further and never touches base weights — it nudges the output distribution at decoding time, closing most of the alignment gap while actually beating direct fine-tuning on knowledge tasks, because direct fine-tuning corrupts knowledge storage in the lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?. All three — KL penalties, channel-splitting, decoding-time steering — are different ways of saying 'change behavior without dragging the distribution far from home.'

There's a quieter twist worth knowing: RL may already do some of this on its own. Across seven algorithms and ten model families, RL updates only 5–30% of parameters, and those sparse updates are nearly identical across random seeds — structural, not arbitrary Does reinforcement learning update only a small fraction of parameters?. But left unregularized, RL also collapses diversity, amplifying one pretraining format within the first epoch while suppressing the alternatives Does RL training collapse format diversity in pretrained models?. That collapse is plasticity loss in action — the model commits hard to one mode and can no longer flex. KL regularization is the brake on that collapse: it keeps the distribution wide enough that the model stays a learner rather than a one-trick specialist.

The thread tying it all together is that forgetting and adaptation loss are two faces of the same thing — a distribution that has moved too far, too narrowly. KL regularization doesn't manage them separately; it constrains the single quantity (distance from base) that both depend on. If you want to follow this idea past weight updates entirely, the corpus also documents agents that sidestep the whole problem by storing new skills in external memory instead of parameters Can agents learn new skills without forgetting old ones? Can agents learn continuously from experience without updating weights? — a reminder that the cleanest way to avoid drift is sometimes to not move the weights at all.

Sources 8 notes

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Is LLM forgetting really knowledge loss or alignment loss?

Research shows that performance degradation after continual learning reflects disrupted task alignment rather than erased knowledge. Safety alignment can be restored with minimal retraining on unrelated examples, proving the underlying knowledge persists—only the activation pathway was disrupted.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

How does KL regularization prevent both forgetting and adaptation loss?

Sources 8 notes

Next inquiring lines