Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?

This explores whether two ways of updating an AI — fast, reversible memory (storing experiences or context) versus slow, permanent weight changes (gradient fine-tuning) — can be split so each handles a different speed of learning, instead of competing.

This explores whether memory-based adaptation and gradient fine-tuning can be assigned to different speeds of learning rather than forced into one mechanism. The corpus answers with a fairly confident yes — and the most direct evidence is an architecture that builds the split into the model itself. Titans separates fast, quadratic attention (short-term, what's happening right now) from a compressed neural memory that holds onto surprising tokens over the long haul Can neural memory modules scale language models beyond attention limits?. The same two-clock idea shows up explicitly in training: Fast-Slow Training routes task-specific lessons into fast, editable prompts while keeping slow parameter updates minimal, reaching the same performance 1.4–3x faster with much less catastrophic forgetting Can splitting adaptation into two channels reduce forgetting?. The striking framing there is that forgetting isn't an inherent cost of learning — it's a *misallocation* problem, what you tried to write into slow weights that belonged in fast context.

Sources 8 notes

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?

Sources 8 notes

Next inquiring lines