Why does reasoning training improve math but hurt knowledge tasks?

This explores why training a model to reason (math, logic, step-by-step problem solving) tends to lift math benchmarks while degrading performance on knowledge-heavy tasks like medicine or factual recall — and what in the model's architecture makes that a tradeoff rather than a free lunch.

This explores why reasoning training and knowledge tasks seem to pull in opposite directions. The cleanest explanation in the corpus is architectural: knowledge and reasoning don't live in the same place inside the model. One analysis finds that factual retrieval happens in the *lower* layers of the network while reasoning adjustments happen in the *higher* layers Why does reasoning training help math but hurt medical tasks?. When you train hard on reasoning, you're reshaping the upper-layer machinery — which is exactly what math benefits from — but that same pressure can disturb the lower-layer retrieval that knowledge-intensive domains like medicine depend on. Improving one region doesn't automatically preserve the other.

Why the two even behave differently goes deeper than layers. Reasoning and factual recall draw on different kinds of learned material. A study of five million pretraining documents found that reasoning generalizes from *broad, transferable procedural knowledge* — patterns of how-to spread across many sources — whereas factual recall depends on *narrow, document-specific memorization* of the exact target fact Does procedural knowledge drive reasoning more than factual retrieval?. So reasoning training reinforces a wide, reusable skill, while knowledge tasks hinge on fragile, pinpoint memories that the same training can overwrite or crowd out. Math improves because it rides the transferable substrate; knowledge suffers because its substrate is exactly the kind that doesn't transfer and is easily perturbed.

There's also a subtler trap worth knowing about: the math gains themselves may be partly illusory in how they're measured. Supervised fine-tuning can raise final-answer accuracy on benchmarks while *cutting the quality of the actual reasoning steps by nearly 39%* — the model learns to produce correct answers through post-hoc rationalization rather than genuine inference, and standard metrics miss it because they only score the final answer Does supervised fine-tuning improve reasoning or just answers?. So part of what looks like 'reasoning improved, knowledge declined' is a measurement artifact: you're rewarding answer-shaped outputs on math while the underlying competence shifts in ways the benchmark can't see.

A reframe that makes the tradeoff feel less mysterious: post-training mostly *selects* reasoning that the base model already had rather than installing new capability — multiple independent methods all elicit latent reasoning already present in base activations Do base models already contain hidden reasoning ability?, and even a single training example can unlock a jump from 36% to 73.6% on math Can a single training example unlock mathematical reasoning?. If training is largely steering attention toward one mode of operating, it makes sense that it would amplify the reasoning pathway at the expense of the retrieval pathway — you're tilting the model, not enlarging it.

Finally, the win is narrower than it looks. Chain-of-thought reasoning is distribution-bounded: it degrades predictably once you leave the training distribution, producing fluent but logically inconsistent steps Does chain-of-thought reasoning actually generalize beyond training data?. And the gains don't even transfer across *kinds* of reasoning — models tuned for formal reasoning fail to improve on social/theory-of-mind tasks, which seem to need a different cognitive architecture entirely Why do reasoning models struggle with theory of mind tasks?. So 'reasoning training helps math but hurts knowledge' is one instance of a broader pattern: optimizing one competence reliably comes at the cost of others the same model used to hold.

Sources 7 notes

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why do reasoning models struggle with theory of mind tasks?

Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.

Why does reasoning training improve math but hurt knowledge tasks?

Sources 7 notes

Next inquiring lines