Why do higher network layers capture procedural knowledge but lower layers store facts?

This explores the finding that LLMs seem to split labor by depth — facts retrieved in lower layers, reasoning and procedure assembled in higher ones — and asks why that division shows up.

This explores the finding that LLMs seem to split labor by depth — facts retrieved in lower layers, reasoning and procedure assembled in higher ones — and asks *why* that division shows up. The cleanest statement of the pattern comes from a two-phase inference model: knowledge retrieval operates in lower network layers while reasoning adjustment happens up top Why does reasoning training help math but hurt medical tasks?. The reason this isn't just a curiosity is its practical bite — it explains why training a model harder on reasoning improves math but can quietly *degrade* knowledge-heavy domains like medicine. If the two functions live in different real estate, tuning one can evict the other.

Why would depth organize itself this way? A useful clue is that procedural and factual knowledge are sourced differently during pretraining in the first place. Reasoning leans on broad, transferable procedures pulled from many diverse documents, while factual recall depends on narrow, document-specific memorization of a single target fact Does procedural knowledge drive reasoning more than factual retrieval?. Facts are point lookups; procedures are patterns abstracted across thousands of examples. It makes sense that a lookup resolves early (you either have the entry or you don't) and that the slower work of combining and transforming those entries stacks up afterward — higher layers operate on what lower layers have already surfaced.

The "reasoning happens higher, and late" story gets sharper from interpretability work showing transformers doing real computation early and then *rewriting* it. In models trained with hidden chain-of-thought, the correct answer is computed in layers 1–3 and then actively suppressed in the final layers to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. That complicates any tidy "facts low, reasoning high" map — the same vertical axis is being used for retrieval, transformation, *and* output shaping, and what a layer is 'for' depends on how the model was trained. A related caution: identical performance can hide radically different internal structures, so a layer-function map that holds for one model may not transfer to another What actually happens inside the minds of language models? What actually happens inside a language model?.

The deeper 'why' may be that this layering is the network discovering modularity on its own. Pruning experiments show neural nets naturally decompose tasks into isolated subnetworks, and pretraining makes that modular structure more consistent and reliable Do neural networks naturally learn modular compositional structure?. Separating storage from manipulation is exactly the kind of reusable structure that a compositional system would converge on — a fact retriever you can call from many different procedures is more efficient than re-deriving facts inside every reasoning path.

If you want to go one step laterally: this internal split echoes how brains and hybrid AI systems are organized. One framing maps transformer weights to a 'neocortex' of consolidated knowledge, retrieval systems to hippocampal indexing, and agentic state to prefrontal control Can brain memory systems explain how LLMs should store knowledge?. The recurring lesson across all of these is the same one that makes the original finding matter — knowing *where* a model keeps a capability tells you what you'll break when you train on top of it.

Sources 7 notes

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

What actually happens inside the minds of language models?

LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.

What actually happens inside a language model?

Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can brain memory systems explain how LLMs should store knowledge?

Research shows transformer weights function as a distributed neocortex for consolidated knowledge, RAG stores as hippocampal indexing for rapid encoding, and agentic state as prefrontal executive control. The CLS framework predicts why hybrid systems outperform single-tier approaches and identifies missing consolidation mechanisms that prevent memory integration.

Why do higher network layers capture procedural knowledge but lower layers store facts?

Sources 7 notes

Next inquiring lines