Are neural network optimizers actually memory systems?
Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.
Nested Learning (NL) proposes that every component of a neural network — including optimizers — is an associative memory system that compresses its own context flow. This is not a metaphor but a formal claim: given keys K and values V, associative memory is an operator M: K → V, and the optimization of M (minimizing a loss over the mapping) is the learning process.
The key insight: gradient-based optimizers like Adam and SGD with Momentum are themselves associative memory modules that aim to compress the gradient context. When you view Adam's running averages as a memory system compressing the history of gradients, the optimizer is doing the same thing as the neural network layers — learning a useful representation of its input stream. This is self-referential: the optimizer that trains the network is itself a memory system being trained by the data.
From the neuropsychology literature: "Memory is a neural update caused by an input, and learning is the process for acquiring effective and useful memory." Under this definition, any parameter update from gradient descent (at any level of the system) is a memory operation. This dissolves the artificial boundary between "the model" and "the training process" — both are memory systems at different nesting levels.
The practical implication is a new architectural dimension. Stacking more layers (depth) has diminishing returns for several reasons: computational depth may not increase, capacity shows marginal improvement, training may converge suboptimally, and adaptation/continual learning ability doesn't improve. NL suggests adding more nesting levels — nested optimization problems — as an orthogonal dimension to depth.
This yields concrete architectures:
- Self-Modifying Titans: A sequence model that learns how to modify itself by learning its own update algorithm
- HOPE: Combines self-modifying sequence model with a continuum memory system — memories stored at different frequencies/timescales for robust memory management against catastrophic forgetting
- Deep optimizers: More expressive optimizers with deep memory and/or more powerful learning rules, going beyond Adam/SGD
The continuum memory system is particularly interesting: it generalizes the traditional short-term/long-term memory distinction into a continuous spectrum. Memory is distributed throughout all parameters, stored at different timescales, without isolated blocks. This mirrors brain organization — distributed interconnected memory without clear independent components for different time horizons.
Source: Novel Architectures
Related concepts in this collection
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans' memory distinction is a special case of NL's continuum memory system
-
When should AI systems do their thinking?
Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
NL suggests the timing question extends to memory: when to consolidate, at what timescale
-
What happens inside models when they suddenly generalize?
Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
grokking phases may correspond to transitions between nesting levels
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
NL questions whether this localization is fundamental or an artifact of single-level training
-
Can text-trained models compress images better than specialized tools?
Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
NL operationalizes the compression=generalization principle at the component level: if every NN component is an associative memory module compressing its own context flow, then the compression-as-generalization equivalence applies not just to the whole model but to each optimizer, layer, and memory system independently
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
all neural network components including optimizers are associative memory modules compressing their own context flow