How does dual-rate learning separate episodic and procedural memory in neural networks?
This explores how 'dual-rate' learning — pairing a fast-changing memory channel with a slow-changing one — lets a network keep specific experiences (episodic) separate from general skills (procedural), and why that separation matters.
This explores how 'dual-rate' learning — running one fast learner alongside one slow learner — keeps specific experiences apart from general know-how. The cleanest example in the corpus is Latent-Thought Language Models, which couple fast local variational learning (per-input 'thought' vectors that adapt quickly) with slow global decoder learning (the shared weights that change gradually). The fast channel captures what's specific to the moment; the slow channel accumulates what generalizes — and because they update at different speeds, the model gains scaling dimensions that don't depend on parameter count alone Can latent thought vectors scale language models beyond parameters?.
The reason this two-speed split is worth the trouble becomes obvious once you look at what happens without it. When a single set of weights has to absorb both fast specifics and slow generalities, you get catastrophic forgetting — new learning overwrites old. 'Fast-Slow Training' reframes that as a misallocation problem: route the task-specific lessons into fast textual context (prompts) and keep the slow parameter updates minimal, and forgetting largely disappears while training runs 1.4–3x faster Can splitting adaptation into two channels reduce forgetting?. Same principle, different substrate: separate the rates and you stop the fast stuff from corrupting the slow stuff.
What makes this more than an engineering trick is that the brain appears to do exactly this, and the corpus draws the map explicitly. The Complementary Learning Systems framing lines up transformer weights with the slow-consolidating neocortex (procedural, distributed knowledge), retrieval/RAG stores with the fast-encoding hippocampus (episodic, rapid capture), and agentic state with prefrontal control Can brain memory systems explain how LLMs should store knowledge?. Dual-rate learning is the computational echo of having a fast 'remember this episode' system feeding a slow 'distill the pattern' system. Titans makes the architectural version concrete — short-term attention plus a separate long-term neural memory that selectively stores surprising tokens Can neural memory modules scale language models beyond attention limits?.
The episodic-vs-procedural distinction the question names also shows up cleanly in how agents learn. Reflexion stores verbal self-reflections as episodic memory — concrete records of specific trials — and improves without ever touching the weights Can agents learn from failure without updating their weights?. Meanwhile, procedural knowledge turns out to be the thing that actually generalizes reasoning: analysis of millions of pretraining documents shows reasoning draws on broad, transferable procedures, while factual recall leans on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. That's the two memory types wearing different clothes — episodes are concrete and local, procedures are abstract and shared.
The most striking wrinkle is that the separation works best when the two streams are treated *asymmetrically*, not just at different speeds. SkillRL keeps successful episodes as concrete demonstrations but compresses failures into abstracted lessons — the win is stored episodically, the loss is stored procedurally — and that asymmetry beats uniform consolidation Should successful and failed episodes be processed differently?. And RL training itself moves through the two registers in sequence: a first phase that masters procedural execution, then a second where strategic planning becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?. So 'dual-rate' isn't only about speed — it's about giving specifics and generalities different jobs, different storage, and sometimes different processing entirely.
Sources 8 notes
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
Research shows transformer weights function as a distributed neocortex for consolidated knowledge, RAG stores as hippocampal indexing for rapid encoding, and agentic state as prefrontal executive control. The CLS framework predicts why hybrid systems outperform single-tier approaches and identifies missing consolidation mechanisms that prevent memory integration.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.