What makes procedural knowledge in documents generalize better than facts?
This explores why a model trained on documents that *show how to reason through problems* transfers more broadly than one trained on documents that *state facts* — and what that difference reveals about what 'generalization' actually is.
This explores why procedural knowledge — the worked steps, methods, and ways of getting to an answer — generalizes more than raw facts, and the corpus has a surprisingly clear answer: the two are stored and retrieved in fundamentally different ways. An analysis of five million pretraining documents found that when a model reasons, it draws on a *broad, diffuse* set of documents that share a method — many sources, none of them containing the specific answer. When it recalls a fact, it leans on a *narrow* set of documents that contain that exact fact Does procedural knowledge drive reasoning more than factual retrieval?. Facts are point-lookups: you either memorized the target or you didn't. Procedures are patterns spread across contexts, so they survive being lifted out of any single one. Generalization, in other words, isn't a property of the knowledge — it's a property of how widely the knowledge was *demonstrated*.
That framing pays off when you look at what happens when models try to reason without genuine procedural grounding. Chain-of-thought reasoning turns out to be distribution-bounded: push a model outside the task shapes, lengths, and formats it saw in training and the reasoning stays fluent but quietly goes logically invalid — it imitates the *form* of a procedure without the transferable substance Does chain-of-thought reasoning actually generalize beyond training data?. The same hollowness shows up in entailment: models predict that a premise supports a conclusion based on whether they've *seen the conclusion before*, not on whether the logic holds — fact-attestation masquerading as inference Do LLMs predict entailment based on what they memorized?. Both are cases where memorized facts are doing the work that a procedure should be doing, and the seams show the moment you move off the beaten path.
There's a deeper hint about *where* procedural knowledge lives. When researchers traced which tokens actually carry the learning signal in reasoning training, they found it concentrated in a small minority — the high-entropy 'forking' tokens where the model decides which way to go. Training on just that ~20% matches full training Do high-entropy tokens drive reasoning model improvements?. Procedure, it seems, is encoded at the decision points, not in the filler — which is exactly why it can be reused: a decision rule applies across many situations, a fact applies to one.
The practical lesson the corpus keeps circling is that the most transferable representations are *partial and structural* rather than complete and literal. Partial symbolic abstraction beats both plain language and full formalization, because it keeps the reusable structure while preserving meaning Why does partial formalization outperform full symbolic logic?. Symbolic rules pulled from a knowledge graph's *shape* outperform retrieval that just matches surface similarity, because a navigational rule generalizes where a matched fact doesn't Can symbolic rules from knowledge graphs guide complex reasoning?. And reading agents that compress a document into a 'gist' before knowing the question outperform fact-by-fact retrieval — the gist is a procedure for *where to look*, not a stored answer Can LLMs read long documents like humans do?.
So the thing you didn't know you wanted to know: 'generalizes better' isn't really about procedures being smarter than facts. It's that a procedure is, by construction, a thing that was shown to work in many places — and anything demonstrated across many contexts is portable to a new one, while anything stored in exactly one place can only ever be retrieved, never transferred.
Sources 7 notes
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
SymAgent derives symbolic rules from KG structure using LLM reasoning to create navigational plans that align natural language with graph topology. This approach captures structural reasoning patterns explicitly, outperforming retrieval methods that rely on semantic similarity alone.
ReadAgent compresses documents into gist memories before knowing the task, then retrieves details only when needed, extending effective context 3–20× and outperforming retrieval baselines on long-document QA.