Can lookup memory and computation work together better than either alone?
Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?
Transformers have one sparsity primitive: conditional computation via Mixture-of-Experts, where dynamic logic routes through sparsely activated parameters. Engram (2601.07372) argues this is incomplete. Knowledge has a different shape from logic. Static facts ("Jacobi was born in 1804") are not dynamic logic; they are key-value lookups. Forcing them through computation wastes capacity on simulating retrieval.
Engram introduces conditional memory as the missing sparsity axis. The instantiation is a modernized N-gram embedding table — local context as key, indexed via constant-time O(1) lookups into a massive embedding store. The modernizations matter: tokenizer compression, multi-head hashing, contextualized gating, multi-branch integration. Classical N-grams failed because they could not compose; these adaptations make them composable with the surrounding transformer.
The surprising empirical result is a U-shaped scaling law in sparsity allocation. At iso-parameter and iso-FLOPs budgets, pure MoE underperforms hybrid MoE+Engram allocations, and pure Engram also underperforms. There is an optimum: some capacity should go to conditional computation (logic), some to conditional memory (lookup). The curve has a single minimum loss; sliding too far in either direction degrades performance.
More surprising: the largest gains are not in knowledge retrieval (MMLU +3.4, CMMLU +4.0) but in general reasoning (BBH +5.0, ARC-Challenge +3.7) and code/math (HumanEval +3.0, MATH +2.4). The mechanistic interpretation: Engram relieves the backbone's early layers from "static reconstruction" — the labor of approximating N-gram statistics through attention and MLPs. With that labor offloaded, early layers can be repurposed for deeper composition. Effectively, Engram deepens the network without adding layers, by freeing parameters to do less local work.
The long-context implication is striking. By delegating local dependencies to lookups, attention capacity is freed for global context. Multi-Query NIAH retrieval rises from 84.2 to 97.0. This suggests the long-context bottleneck is not pure context length but attention's dual burden: it must simultaneously do local approximation and global integration. Separating those labors helps both.
The architectural framing — sparsity has multiple axes, computation and memory are complementary — sets up "memory primitives" as first-class design objects for next-generation sparse models. Most prior memory-augmented work treated external memory as a workaround for parametric limits; Engram positions conditional memory as a co-equal primitive.
Related concepts in this collection
-
Can retrieval knowledge compress into a tiny parametric model?
Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.
Memory Decoder compresses non-parametric retrieval into a parametric module; Engram is the inverse direction — adding lookup primitive to parametric models
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans/Miras add neural memory as architectural component; Engram is the static-lookup analog, complementing rather than competing
-
Can recursive subtask trees overcome context window limits?
Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
TIMRUN frees attention from long-history burden via pruning; Engram frees attention from local-statistics burden via lookup; both reframe attention's job
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
conditional memory is a complementary sparsity axis to conditional computation — hybrid lookup plus MoE beats pure MoE at iso-parameter and iso-FLOPs