Why doesn't mathematical reasoning transfer to medicine?
Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.
The assumption behind porting reasoning-capable models to specialized domains is that reasoning ability transfers — that a model trained to reason well about mathematics can be steered toward medical reasoning through fine-tuning. The Knowledge or Reasoning paper falsifies this assumption with a specific mechanism.
R1-distilled models — fine-tuned variants of strong base models specifically trained to produce long reasoning chains — do not outperform base models on medical benchmarks when evaluated with domain-specific metrics (KI/InfoGain). The general reasoning capabilities that make R1-distilled models effective on mathematical tasks do not transfer to the medical domain via either SFT or RL. The limiting factor is domain knowledge, not reasoning architecture.
The mechanism is clarified by the KI/InfoGain framework. In medical tasks, knowledge accuracy (KI) correlates more strongly with final accuracy than reasoning step informativeness (InfoGain) across four of five benchmarks. Mathematical reasoning has the inverse pattern: reasoning quality matters more than factual knowledge retrieval. These are different competency regimes. A model optimized for one regime cannot import its advantages to the other.
This is distinct from Can non-reasoning models catch up with more compute?, which is about inference-compute regime differences within the same training framework. That finding says you can't close the gap by adding more inference-time compute. This finding says you can't close the gap by fine-tuning either — the gap is in the underlying domain knowledge, which fine-tuning on the wrong type of reasoning traces cannot supply.
The practical implication for domain AI deployment: a strong general reasoning model is not a substitute for domain-specific training data. In knowledge-intensive domains, the ceiling is what the model knows, not how it reasons. Systems that assume general reasoning strength translates to domain-specific reliability will be overconfident about their actual performance. Does supervised fine-tuning actually improve reasoning quality? adds that even when SFT improves accuracy in domain tasks, the reasoning quality may degrade — compounding the problem.
Source: Domain Specialization
Related concepts in this collection
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
compute-regime gap; this is knowledge-regime gap — different mechanisms
-
Does medical AI need knowledge or reasoning more?
Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?
why transfer fails: the two domains require different model investments
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
SFT improves accuracy but doesn't solve the underlying knowledge problem
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
RL corrects reasoning paths but can't substitute for missing domain knowledge
-
Can models learn reasoning from predicting text alone?
Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.
extends: even Quiet-STaR's token-level general reasoning is bounded by training corpus diversity; this note explains the harder ceiling once deployment hits knowledge-intensive domains
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
general reasoning does not transfer to knowledge-intensive domains via sft due to domain knowledge gaps