Why do medical and mathematical tasks require fundamentally different model capabilities?

This explores why medicine and math pull on different machinery inside a model — one rewards stored knowledge, the other rewards reasoning — and what that split means for how you'd train each.

This explores why medicine and math pull on different machinery inside a model, and the corpus has a surprisingly clean answer: the two domains fail for opposite reasons. Math is *reasoning-dominant* — performance rises when the model gets better at working through steps. Medicine is *knowledge-dominant* — performance hinges on whether the model actually knows the fact, not on how cleverly it reasons. The KI/InfoGain framework makes this explicit: medical accuracy tracks knowledge correctness, while math shows the inverse pattern Does medical AI need knowledge or reasoning more?. The blunt consequence is that pouring reasoning training into a medical model doesn't help — R1-distilled reasoning models fail to beat their base versions on medical tasks, because no amount of step-by-step thinking conjures a fact the model never absorbed Why doesn't mathematical reasoning transfer to medicine?.

What makes this more than a benchmarking quirk is that the split appears to be *physically located* in the network. One line of work argues knowledge retrieval happens in the lower layers while reasoning adjustment happens in the higher layers — two phases, two regions Why does reasoning training help math but hurt medical tasks?. That layering is the mechanistic reason reasoning-focused training can *improve* math while *degrading* medicine: you're tuning the upper layers and leaving the knowledge floor untouched, or even disturbing it. So 'different capabilities' isn't a metaphor — it's closer to different tenants in different parts of the building.

The asymmetry shows up vividly from the math side too. A single training example can flip a model's math accuracy from 36% to 73% and keep improving long after training accuracy saturates Can a single training example unlock mathematical reasoning?. That only works because the reasoning capability was *latent* — already present, waiting to be activated. Knowledge has no equivalent trick. You can't 'activate' a clinical fact the model was never exposed to, which is why models stay confidently wrong in specialized domains: low accuracy paired with high confidence, and prompting tricks that rescue general performance don't touch it Why do language models fail confidently in specialized domains?.

The deeper lesson — the thing you might not have come looking for — is that 'capability' isn't one dial. The field's instinct has been to treat reasoning as a general-purpose engine that lifts every domain, but the corpus keeps finding that reasoning is a *separable, transferable* skill that rides on top of domain knowledge rather than substituting for it. Work separating a 'decomposer' from a 'solver' finds the decomposition ability transfers across domains while the solving ability does not Does separating planning from execution improve reasoning accuracy?, and other work shows reasoning models 'wander' unsystematically when problems get deep Why do reasoning LLMs fail at deeper problem solving?. Put together: math exposes whether your reasoning is *systematic*, medicine exposes whether your knowledge is *correct*, and a model can be excellent at one while hollow at the other. Which is really an argument that there's no such thing as a single, domain-agnostic 'smartness' — only a stack of distinct competencies that you have to train, and locate, separately.

Sources 7 notes

Does medical AI need knowledge or reasoning more?

The KI/InfoGain framework reveals that medical domain accuracy correlates more strongly with knowledge correctness than reasoning quality, while mathematical domains show the inverse pattern. This distinction has direct implications for which training strategies to prioritize in each domain.

Why doesn't mathematical reasoning transfer to medicine?

R1-distilled reasoning models fail to outperform base models on medical tasks because knowledge accuracy matters more than reasoning quality in medicine—the opposite of math. Fine-tuning cannot close this gap without domain-specific training data.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Why do language models fail confidently in specialized domains?

LLMs trained on general text lack sufficient exposure to domain-specific examples, leading to low accuracy paired with high confidence in clinical NLI tasks. Prompting techniques that improved general performance fail to reduce overconfidence in specialized domains.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Why do medical and mathematical tasks require fundamentally different model capabilities?

Sources 7 notes

Next inquiring lines