How do adaptive memory modules compare to feedback-based working memory for long context?

This explores two rival ways of giving a model long-term memory: bolting on a separate module that decides what to store (adaptive memory), versus making the model reuse its own internal states as a working scratchpad (feedback-based working memory) — and the corpus shows they answer different questions.

This explores two rival ways of giving a model long-term memory: bolting on a separate module that decides what to store, versus making the model loop back on its own internal states as a scratchpad. The clearest example of the first is Titans, which splits the system into short-term attention and a dedicated neural memory that adaptively memorizes only the *surprising* tokens, letting it stretch past two million tokens without paying attention's quadratic cost Can neural memory modules scale language models beyond attention limits?. The clearest example of the second is TransformerFAM, where a feedback loop lets a transformer attend to its own latent representations — no new weights, just emergent working memory that handles indefinitely long inputs Can models learn working memory by attending to their own latents?.

The honest comparison is that they aren't really competing for the same job. Adaptive memory is about *selective retention* — choosing what's worth keeping out of a flood of context, the way Titans privileges the unexpected. Feedback working memory is about *active maintenance* — keeping a compressed running state alive across the stream so the model never loses the thread. One is a librarian deciding what to archive; the other is a person holding a phone number in their head while they dial.

What makes the comparison interesting is a third paper that reframes the whole contest. Research on the long-context bottleneck argues the real constraint isn't memory *capacity* at all — it's the *compute* needed to transform evicted context into internal state, something that improves with more consolidation passes during an offline 'sleep' phase Is long-context bottleneck really about memory or compute?. By that lens, both approaches are buying the same thing — cheaper consolidation — and the architectural choice is secondary to how much compute you spend folding context into fast weights. This matters because reasoning quality degrades sharply with input length well below the context window's nominal limit (accuracy dropping from 92% to 68% with just 3,000 tokens of padding), so simply 'fitting more in' was never the win Does reasoning ability actually degrade with longer inputs?.

The corpus also offers two cheaper, non-architectural cousins worth knowing about. ReadAgent skips fancy memory entirely and just compresses documents into human-style 'gist memories,' then looks up details only when a task demands them — extending effective context up to twentyfold without touching the model's internals Can LLMs read long documents like humans do?. And Reflexion shows that for agents, 'memory' can just be verbal self-reflections stored as episodic text between episodes — learning from feedback with zero weight updates Can agents learn from failure without updating their weights?. These hint that the adaptive-vs-feedback question lives on a spectrum from 'change the architecture' to 'just write things down.'

The takeaway you didn't know you wanted: there's no single 'working memory' to build. RAISE shows agent memory decomposes into four distinct components across two time scales — dialogue-level history versus turn-level trajectory — each with its own failure modes and update rules How should agent memory split across time scales?. So the real comparison isn't 'adaptive module vs. feedback loop' as a winner-take-all; it's which mechanism fits which slice of memory. Titans-style retention suits the slow, archival channel; FAM-style feedback suits the fast, in-flight channel — which is exactly the slow-weights/fast-context division that turns out to also reduce catastrophic forgetting Can splitting adaptation into two channels reduce forgetting?.

Sources 8 notes

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can models learn working memory by attending to their own latents?

TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can LLMs read long documents like humans do?

ReadAgent compresses documents into gist memories before knowing the task, then retrieves details only when needed, extending effective context 3–20× and outperforming retrieval baselines on long-document QA.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

How do adaptive memory modules compare to feedback-based working memory for long context?

Sources 8 notes

Next inquiring lines