Reasoning and Learning Architectures

Does staying close to the base model preserve learning ability?

Explores whether limiting how far training pushes a model from its base distribution (measured by KL divergence) helps it learn new tasks more effectively over time, and why that trade-off matters for continual learning.

Note · 2026-05-28 · sourced from Training Fine Tuning
How do language models learn to think like humans?

There is a quiet variable connecting forgetting, generalization, and the ability to keep learning: how far training pushes the policy from its base distribution, measured as KL divergence. The Fast-Slow result makes the relationship explicit. FST-trained models stay up to 70% closer to the base LLM in KL than parameter-only RL — and that reduced drift is not just a forgetting story. It preserves plasticity: after training on one task, FST models adapt more effectively to a subsequent task, while parameter-only RL stalls when task domains change on the fly.

The pattern is that drift and plasticity trade off. Each parameter update that improves in-domain reward also moves the model toward a sharper, lower-entropy policy specialized to that task. Specialization is exactly what makes the model less able to absorb the next task — the weights have committed. By keeping most task-specific adaptation in the fast textual channel and letting the slow weights move only a little, FST holds the policy near its flexible base, where it retains the entropy and breadth needed to learn again. Low KL drift is the leading indicator; preserved plasticity and reduced forgetting are downstream consequences.

Why it matters: it gives continual learning a measurable target. Rather than treating "don't forget" and "stay adaptable" as separate desiderata to engineer, you can watch a single quantity — distance from base — and recognize that overshooting it is what produces both forgetting and plasticity loss. It also reframes KL regularization (already standard in RLHF as a leash) as not merely a stability or alignment-preservation device but as the mechanism that keeps the model trainable in the future. The counterpoint: staying near base also caps how much any single task can specialize the weights, so for a one-shot deployment with no future tasks, aggressive drift may be the better trade.


— "Learning, Fast and Slow: Towards LLMs That Adapt Continually", https://arxiv.org/abs/2605.12484

Related concepts in this collection

Concept map
15 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

lower kl drift from the base model preserves plasticity enabling stronger continual learning on later tasks