Does reasoning rely on procedural knowledge or factual memorization?
Explores whether LLMs learn reasoning through general procedural patterns across documents or through memorizing specific facts. Understanding this distinction matters for training data strategy.
The "Procedural Knowledge in Pretraining Drives Reasoning" paper analyzes which pretraining documents most influence LLM reasoning by ranking 5 million documents by their influence on model completions. The finding: the approach to reasoning that models use is unlike retrieval. For reasoning tasks, positively influential documents contain procedural knowledge — descriptions of how to get to a solution — rather than the specific facts needed for the answer.
Three contrasts with factual recall:
Generality: models rely on a broader, more general set of documents when reasoning than when answering factual questions. Factual recall draws on a narrow set of documents containing the target fact. Reasoning draws on a diffuse set of documents performing similar procedures.
Transferability: documents have similar influence on reasoning queries that require applying the same procedure to different numbers. The procedural knowledge transfers across specific instances — it's the method, not the content, that the model has learned.
Reliance distribution: the model needs to see factual information more often (across more documents) to memorize it, while procedural patterns can be learned from fewer but more diverse demonstrations.
This connects to the knowledge/reasoning layer separation. Since Why does reasoning training help math but hurt medical tasks?, the procedural knowledge finding provides the data-level explanation for the architectural finding: lower layers store memorized facts (requiring document-specific exposure), while higher layers encode procedural strategies (learnable from general demonstrations).
The implication for training data curation: reasoning capability benefits more from diverse demonstrations of procedures than from exhaustive factual coverage. Quality and diversity of reasoning demonstrations may matter more than volume for building reasoning capability — consistent with Can models improve themselves on tasks without verifiable answers?.
Source: Training Fine Tuning
Related concepts in this collection
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
architectural explanation for the data-level finding: procedural knowledge lives in higher layers
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require objective correctness signals, limiting them to math and code. Can models self-improve on open-ended instruction tasks where answers can't be automatically verified?
consistent: small amounts of diverse procedural demonstration catalyze reasoning
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
procedural knowledge from pretraining IS the latent capability that minimal signals unlock
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
tension: procedural knowledge may be a form of heuristic rather than genuine reasoning
-
Can text-trained models compress images better than specialized tools?
Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
procedural knowledge compresses better than factual knowledge (one procedure covers many instances), directly explaining why compression = generalization is more powerful for reasoning than for factual recall
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
procedural knowledge in pretraining documents drives reasoning generalization unlike factual retrieval which requires document-specific memorization