Reasoning and Learning Architectures

Can splitting adaptation into two channels reduce forgetting?

When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?

Note · 2026-05-28 · sourced from Training Fine Tuning
How do language models learn to think like humans?

Treating parameter updates as the sole mechanism of adaptation creates a bottleneck: every improvement — a reusable reasoning skill, a task heuristic, even a transient lesson from recent rollouts — has to be written into the same persistent weights. Because the whole policy lives in those weights, any update that raises in-domain reward simultaneously drags the model away from its base behavior, reducing entropy, hurting out-of-distribution generalization, and eroding the model's ability to adapt to future tasks (plasticity loss).

Fast-Slow Training resolves this by refusing to make weights carry everything. It splits adaptation into a slow parametric component (model weights, expensive to update, persisting long-lived behavior) and a fast textual component (prompts, instructions, task context, optimized via reflective prompt evolution with GEPA). The fast channel absorbs task-specific and rapidly-changing information from textual feedback; the slow channel consolidates only persistent behavior and stays closer to the base model. Interleaving the two — RL updates plus context optimization — reaches matched performance with 1.4–3x fewer optimizer steps and a higher asymptote, while leaving the model far closer to its origin.

Why it matters: it reframes catastrophic forgetting as a misallocation problem rather than an inherent cost of learning. Forgetting happens because we force weights to store things that did not belong in weights. Route the transient and task-specific into context, and the weights stay general — so there is less to forget. This is a division-of-labor argument: the two channels operate at different timescales (an echo of System 1 vs System 2) and each does what it is suited for. The counterpoint is that the fast channel's capacity is bounded by context length and prompt-optimization quality, so genuinely large bodies of new knowledge still have to land in weights eventually.


— "Learning, Fast and Slow: Towards LLMs That Adapt Continually", https://arxiv.org/abs/2605.12484

Related concepts in this collection

Concept map
14 direct connections · 141 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

splitting adaptation into slow weights and fast textual context avoids catastrophic forgetting and plasticity loss