Reasoning and Learning Architectures

Can lookup memory and computation work together better than either alone?

Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?

Note · 2026-05-18 · sourced from Memory
What kind of thing is an LLM really?

Transformers have one sparsity primitive: conditional computation via Mixture-of-Experts, where dynamic logic routes through sparsely activated parameters. Engram (2601.07372) argues this is incomplete. Knowledge has a different shape from logic. Static facts ("Jacobi was born in 1804") are not dynamic logic; they are key-value lookups. Forcing them through computation wastes capacity on simulating retrieval.

Engram introduces conditional memory as the missing sparsity axis. The instantiation is a modernized N-gram embedding table — local context as key, indexed via constant-time O(1) lookups into a massive embedding store. The modernizations matter: tokenizer compression, multi-head hashing, contextualized gating, multi-branch integration. Classical N-grams failed because they could not compose; these adaptations make them composable with the surrounding transformer.

The surprising empirical result is a U-shaped scaling law in sparsity allocation. At iso-parameter and iso-FLOPs budgets, pure MoE underperforms hybrid MoE+Engram allocations, and pure Engram also underperforms. There is an optimum: some capacity should go to conditional computation (logic), some to conditional memory (lookup). The curve has a single minimum loss; sliding too far in either direction degrades performance.

More surprising: the largest gains are not in knowledge retrieval (MMLU +3.4, CMMLU +4.0) but in general reasoning (BBH +5.0, ARC-Challenge +3.7) and code/math (HumanEval +3.0, MATH +2.4). The mechanistic interpretation: Engram relieves the backbone's early layers from "static reconstruction" — the labor of approximating N-gram statistics through attention and MLPs. With that labor offloaded, early layers can be repurposed for deeper composition. Effectively, Engram deepens the network without adding layers, by freeing parameters to do less local work.

The long-context implication is striking. By delegating local dependencies to lookups, attention capacity is freed for global context. Multi-Query NIAH retrieval rises from 84.2 to 97.0. This suggests the long-context bottleneck is not pure context length but attention's dual burden: it must simultaneously do local approximation and global integration. Separating those labors helps both.

The architectural framing — sparsity has multiple axes, computation and memory are complementary — sets up "memory primitives" as first-class design objects for next-generation sparse models. Most prior memory-augmented work treated external memory as a workaround for parametric limits; Engram positions conditional memory as a co-equal primitive.


Paper: Conditional Memory via Scalable Lookup

Related concepts in this collection

Concept map
13 direct connections · 107 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

conditional memory is a complementary sparsity axis to conditional computation — hybrid lookup plus MoE beats pure MoE at iso-parameter and iso-FLOPs