LLM Reasoning and Architecture

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.

Note · 2026-02-23 · sourced from Novel Architectures

Nested Learning (NL) proposes that every component of a neural network — including optimizers — is an associative memory system that compresses its own context flow. This is not a metaphor but a formal claim: given keys K and values V, associative memory is an operator M: K → V, and the optimization of M (minimizing a loss over the mapping) is the learning process.

The key insight: gradient-based optimizers like Adam and SGD with Momentum are themselves associative memory modules that aim to compress the gradient context. When you view Adam's running averages as a memory system compressing the history of gradients, the optimizer is doing the same thing as the neural network layers — learning a useful representation of its input stream. This is self-referential: the optimizer that trains the network is itself a memory system being trained by the data.

From the neuropsychology literature: "Memory is a neural update caused by an input, and learning is the process for acquiring effective and useful memory." Under this definition, any parameter update from gradient descent (at any level of the system) is a memory operation. This dissolves the artificial boundary between "the model" and "the training process" — both are memory systems at different nesting levels.

The practical implication is a new architectural dimension. Stacking more layers (depth) has diminishing returns for several reasons: computational depth may not increase, capacity shows marginal improvement, training may converge suboptimally, and adaptation/continual learning ability doesn't improve. NL suggests adding more nesting levels — nested optimization problems — as an orthogonal dimension to depth.

This yields concrete architectures:

The continuum memory system is particularly interesting: it generalizes the traditional short-term/long-term memory distinction into a continuous spectrum. Memory is distributed throughout all parameters, stored at different timescales, without isolated blocks. This mirrors brain organization — distributed interconnected memory without clear independent components for different time horizons.


Source: Novel Architectures

Related concepts in this collection

Concept map
13 direct connections · 120 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

all neural network components including optimizers are associative memory modules compressing their own context flow