Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
The Decoupling Knowledge and Reasoning paper proposes a testable two-phase model of LLM inference by contrasting fast thinking (no chain-of-thought) with slow thinking (CoT-enabled). Fast thinking engages Phase 1 only: knowledge retrieval from lower network layers. Slow thinking adds Phase 2: reasoning adjustment in higher layers. Comparing the two isolates each phase's contribution.
Across 15 LLMs on 3 datasets, three findings:
Domain-specificity of reasoning benefit: Phase 2 (reasoning adjustment) helps math, physics, and chemistry but can impair performance on knowledge-intensive domains. In medical tasks, the Phase 1 knowledge retrieved may be more reliable than the Phase 2 reasoning applied on top of it — reasoning adjustment introduces error rather than correcting it.
Scaling asymmetry: parameter scaling improves both phases, but knowledge improvement (Phase 1) dominates. Larger models know more, and this knowledge advantage outpaces the reasoning advantage. Scaling makes models more "prudent" (better at not making errors) across all domains, but only "more intelligent" (better at novel inference) in reasoning-intensive ones.
Layer localization: knowledge retrieval is primarily a lower-layer phenomenon; reasoning adjustment operates in higher layers. This is a functional architectural separation — not just a behavioral one.
The layer localization provides the mechanistic explanation for the SFT knowledge gap. CoT fine-tuning and RLVR modify higher-layer behavior. They cannot improve the lower-layer knowledge encoding that knowledge-intensive tasks depend on. Adding reasoning training to a model that lacks medical knowledge won't close the knowledge gap — it modifies a layer that isn't the bottleneck.
Architectural evidence for layer redundancy: The "Unreasonable Ineffectiveness of the Deeper Layers" (2403.17887) provides striking corroboration. Up to half of LLM layers can be pruned with minimal degradation on question-answering benchmarks, using a simple strategy: identify optimal block of layers to prune by cross-layer similarity, then heal with QLoRA finetuning on a single A100 GPU. This implies either that current pretraining methods are not properly leveraging the parameters in deeper layers, or that shallow layers play a disproportionately critical role in storing knowledge. Both interpretations reinforce the functional separation: if knowledge resides in lower layers, the deeper layers' contribution may be primarily redundant refinement rather than essential computation.
Retrieval heads as mechanistic evidence: The "Retrieval Head" paper provides direct causal evidence for layer specialization. A sparse set of attention heads (<5%) are responsible for retrieving relevant information from long context. These retrieval heads are: (1) universal across model families, (2) intrinsic — they exist in short-context models and persist through context-length extension, (3) dynamically activated — some always attend to required information while others activate contextually, and (4) causal — pruning them causes hallucination while pruning non-retrieval heads has no effect. Retrieval heads strongly influence CoT reasoning (which requires referring back to prior context) but minimally affect tasks where the model generates from intrinsic knowledge. This is a specific mechanistic instantiation of the lower-layer knowledge retrieval function described above. See What mechanism enables models to retrieve from long context?.
Latent concept hierarchy: The "Discovering Latent Concepts Learned in BERT" (2205.07237) confirms the layer hierarchy from a representation perspective. Lower layers dominate in learning shallow lexical concepts, while higher layers learn semantic relations. Critically, BERT learns novel concepts (e.g., animal categories, demographic groups) that do not adhere to predefined categorizations — the model discovers its own organizational structure. Several latent concepts are based on multiple properties spanning semantics, syntax, and morphology simultaneously, suggesting the layer separation is not clean but follows a general gradient.
The "Procedural Knowledge in Pretraining Drives Reasoning" paper provides the data-level explanation that complements this architectural finding. By ranking 5 million pretraining documents by their influence on model completions, they show that reasoning draws on a diffuse set of documents containing procedural knowledge (descriptions of how to solve), while factual recall draws on narrow document sets containing the target fact. This maps directly onto the layer separation: lower layers store memorized facts (requiring document-specific exposure), while higher layers encode procedural strategies (learnable from general demonstrations of method). See Does reasoning rely on procedural knowledge or factual memorization?.
Source: Reasoning by Reflection; enriched from Training Fine Tuning, LLM Architecture
Related concepts in this collection
-
Does medical AI need knowledge or reasoning more?
Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?
layer localization is the mechanistic explanation for the behavioral pattern this note documents
-
Why doesn't mathematical reasoning transfer to medicine?
Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.
transfer fails because SFT modifies higher-layer reasoning while the bottleneck is lower-layer knowledge; this paper makes that precise
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
layer localization explains the encoding-generation gap: knowledge in lower layers may be overridden by higher-layer reasoning adjustments that introduce error, producing the failure mode where the model "knows" the answer but generates an incorrect one
-
Can text-trained models compress images better than specialized tools?
Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
the compression framing maps onto the layer separation: lower layers compress facts (document-specific memorization), higher layers compress procedures (generalizable reasoning); the scaling caveat on adjusted compression may reflect redundancy in deeper layers
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
knowledge resides in lower network layers and reasoning in higher layers — this functional separation explains why reasoning training helps math but can impair knowledge-intensive domains