Why does in-weight memorization fail compared to tool-based fact access?
This explores why storing facts inside a model's weights runs into hard limits that giving the model a tool to look things up does not — and what the corpus says is actually going wrong with in-weight memory.
This explores why baking facts into a model's parameters keeps failing where a lookup tool succeeds. The cleanest answer in the corpus is a capacity argument: in-weight factual recall is mathematically bounded by how big the model is, while tool use lets a model reach unbounded facts through a surprisingly simple internal circuit. The same work shows the hidden cost of trying to cram more in — fine-tuning new facts into the weights overwrites prior knowledge and degrades general capability Can models store unlimited facts without growing larger?. So it isn't just that weights run out of room; pushing facts in actively damages what was already there.
That damage has a known location. Memorized content leaves a fingerprint in the lowest layers — large low-layer gradients and a rare-token attention head — which is exactly the machinery that direct fine-tuning disturbs Where does a model store memorized paragraphs?. This is why decoding-time approaches that leave base weights untouched preserve knowledge so much better: proxy-tuning closes most of the alignment gap while beating direct fine-tuning on knowledge tasks, precisely because direct fine-tuning corrupts storage in those lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The lesson repeats — the part of the network that holds facts is fragile, and editing it is destructive.
There's also a quality problem with memorized facts beyond their quantity. Models that lean on what they memorized show attestation bias: they judge a claim true if the statement looks familiar from training, not because the evidence supports it Do LLMs predict entailment based on what they memorized?. And memorized knowledge is frozen in time — search agents trained on live retrieval beat static memorized models on hard questions not by reasoning better but by dodging the temporal staleness and lossy compression that come from storing everything in weights Why do search agents beat memorized retrieval on hard questions?.
The deeper reframing is that weights may simply be the wrong substrate for facts in the first place. One large pretraining study finds reasoning rides on broad, transferable *procedural* knowledge, while factual recall depends on narrow, document-specific memorization of the exact target — two different things the network does, and only one of them generalizes Does procedural knowledge drive reasoning more than factual retrieval?. If facts are inherently look-up-shaped rather than skill-shaped, externalizing them is the natural fit. That's also why routing a query to the right external structure — a table, a graph, a catalogue — outperforms uniform retrieval and, by extension, undifferentiated in-weight storage Can routing queries to task-matched structures improve RAG reasoning?.
The interesting wrinkle: in-weight storage isn't a dead end, it just can't be done by brute-force fine-tuning. A 'sleep phase' approach consolidates in-context knowledge into weights through distillation and rehearsal *without* the catastrophic forgetting that plagues direct training Can models consolidate memories during offline sleep phases? — suggesting the real failure isn't weights-as-memory but the crude way we write to them.
Sources 8 notes
A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.
Memorized paragraphs leave a distinctive fingerprint in GPT-Neo: larger gradients in lower layers, concentration in a specific low-layer attention head attending to rare tokens, and dependence on a few early-prefix tokens. This localization makes memorization targetable for unlearning.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
DeepResearcher agents trained on live web search beat static knowledge models on knowledge-intensive tasks. The mechanism is not better reasoning but retrieval: real-time search avoids temporal bounds and probabilistic compression that plague training-data memorization.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
The Sleep paradigm uses Knowledge Seeding (distilling smaller networks into larger ones) and Dreaming (RL-generated rehearsal) to consolidate in-context knowledge into weights without forgetting. Gains appear in long-context understanding, few-shot reasoning, and continual learning.