How do retention gates regularize forgetting across different sequence model architectures?

This explores how sequence models hold onto useful information while letting go of the rest — but the corpus approaches that problem through architectural *separation* of memory rather than literal 'retention gate' mechanisms (the gating units in RetNet-style or gated linear-attention models), so I'll map what it does have and flag the gap.

This explores how sequence models keep what matters and discard the rest. A clarification up front: the collection doesn't contain papers on retention gates as a specific mechanism — the learned gating units inside gated linear-attention or RetNet-style architectures that decay old state. What it has instead is a richer, more interesting answer to the underlying worry behind your question: forgetting isn't one problem with one knob, and the strongest results come from *separating memory into channels* rather than tuning a single gate.

The recurring insight across the corpus is that catastrophic forgetting is a misallocation problem, not an inherent cost. Fast-Slow Training makes this explicit: it routes task-specific lessons into fast textual context (prompts) while keeping slow weight updates minimal, reaching the same performance much faster with far less forgetting Can splitting adaptation into two channels reduce forgetting?. SoftCoT reaches the same destination by a different road — freeze the main model entirely and delegate new reasoning to a small auxiliary module, so pre-trained knowledge can't be overwritten Can continuous reasoning avoid forgetting in instruction-tuned models?. VOYAGER pushes the separation all the way out of the weights: store skills in an external, embedding-indexed library and compose new ones from old, so lifelong learning never touches the parameters that could forget Can agents learn new skills without forgetting old ones?. Three architectures, one shared move — give new information its own home so it doesn't evict the old.

The closest thing the corpus has to an actual retention mechanism is Titans, which splits attention (short-term, quadratic, expensive) from a neural memory module that adaptively decides *what* to store — prioritizing surprising tokens and compressing the rest, scaling past two million tokens Can neural memory modules scale language models beyond attention limits?. This is the spirit of a retention gate generalized: instead of a fixed decay applied uniformly, surprise becomes the signal for what earns long-term storage. It reframes 'regularizing forgetting' as 'learning what's worth remembering.'

There's a deeper cross-cutting framing worth pulling out. One note argues the long-context bottleneck isn't memory *capacity* at all — it's the *compute* needed to consolidate evicted context into fast weights, during offline 'sleep' passes, and that more consolidation passes keep improving performance Is long-context bottleneck really about memory or compute?. Read alongside Titans, this suggests forgetting is regulated less by a gate that throws things away and more by how much work a model is willing to do to fold information into durable state. And there's a hard ceiling underneath all of it: models memorize up to roughly 3.6 bits per parameter, then undergo a phase transition into generalization When do language models stop memorizing and start generalizing? — so 'what gets retained' is bounded by a measurable capacity, not just by architecture.

The thing you may not have known you wanted to know: the best forgetting-control results in this collection don't come from a smarter gate inside one architecture — they come from refusing to make one set of weights do both jobs. Whether the channel is a frozen backbone plus a helper, slow weights plus fast prompts, or attention plus a surprise-driven memory, the winning pattern is architectural division of labor, and that pattern transfers across model types in a way a single retention gate doesn't.

Sources 6 notes

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

When do language models stop memorizing and start generalizing?

GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.

How do retention gates regularize forgetting across different sequence model architectures?

Sources 6 notes

Next inquiring lines