How many document exposures does procedural knowledge versus factual information require?

This explores a finding about how models learn differently from their training data: factual recall leans on seeing the specific target fact (narrow, memorized), while reasoning skills get assembled from many documents that never state the answer.

This explores a finding about how models learn differently from their training data — and the short version is that the two kinds of knowledge live on opposite ends of a spectrum. When researchers traced 5 million pretraining documents to see what a model actually leaned on for a given output, factual retrieval turned out to be narrow and document-specific: to answer a fact reliably, the model needs to have memorized that exact fact from a small number of sources that state it directly. Reasoning was the opposite — it drew on a broad, diffuse spread of documents demonstrating *procedures* (how to work through a problem), none of which contained the target answer Does procedural knowledge drive reasoning more than factual retrieval?. So 'how many exposures' isn't one number: a fact wants a few direct hits, a procedure wants many indirect ones.

What makes this counterintuitive is that procedural knowledge generalizes *because* it's spread thin across many examples. The model isn't copying one worked solution; it's averaging over a style of working that recurs in different guises, which is why reasoning transfers to problems it never saw. Factual recall can't transfer the same way — if the fact wasn't in the training data, no amount of reasoning style conjures it.

The corpus has a striking sibling to this. A separate line of work shows models can reconstruct knowledge that was *never stated in any single document*, piecing it together from implicit hints scattered across the training set — inferring a censored city's identity from fragments of distance relationships, for instance Can LLMs reconstruct censored knowledge from scattered training hints?. That's the procedural-style, distribution-wide learning at work on facts: when direct exposure is denied, the model falls back on connecting many weak signals. It's the same mechanism the reasoning paper describes, just pointed at a fact instead of a method.

There's a practical seam here too. If factual knowledge is the brittle, exposure-hungry part, then knowing *when to look it up* rather than recall it becomes the real skill. One approach frames each reasoning step as a decision about whether to retrieve external knowledge or trust the model's internal store, and gains accuracy precisely by switching to retrieval for the narrow facts while leaning on parametric procedure for the reasoning When should language models retrieve external knowledge versus use internal knowledge?. The division of labor mirrors the pretraining finding: retrieve the facts, internalize the procedures.

The thing worth carrying away: the question of 'how many exposures' quietly reframes what we think a language model is doing. It isn't one memory system with a single learning curve — it's two. One bank fills up by direct repetition and stays local; the other distills a way of operating from a whole distribution and travels.

Sources 3 notes

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can LLMs reconstruct censored knowledge from scattered training hints?

Language models perform out-of-context reasoning across the full training distribution, reconstructing information never explicitly stated in any single document. Experiments show models can infer city identities from scattered distance relationships and apply them downstream without in-context learning.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

How many document exposures does procedural knowledge versus factual information require?

Sources 3 notes

Next inquiring lines