INQUIRING LINE

Why does contextual judgment matter more in law and medicine than in mathematics?

This explores why some domains lean on contextual knowledge and human judgment while math leans on portable reasoning — and what the corpus reveals about that split inside AI systems.


This reads the question as asking about a real divide the corpus keeps circling: math is a reasoning-dominant domain, while law and medicine are knowledge-dominant ones where the right answer depends on facts, context, and authority that can't be derived from first principles. The clearest evidence comes from work showing that medical accuracy correlates far more with whether a model *knows* the right thing than with how well it reasons, while mathematical performance shows the inverse — better reasoning, better answers Does medical AI need knowledge or reasoning more?. This is why training a model to reason harder helps it on math but can actively *degrade* it on medicine: pushing the higher network layers that handle reasoning can disturb the lower layers where factual knowledge lives Why does reasoning training help math but hurt medical tasks?.

The deeper reason contextual judgment matters in law and medicine is that reasoning skill doesn't transfer the way you'd hope. A model distilled to be a strong mathematical reasoner fails to beat a plain base model on medical tasks, because no amount of clean inference closes a gap that is really about missing domain-specific knowledge Why doesn't mathematical reasoning transfer to medicine?. Math is self-contained — the chain of steps validates itself. Medicine and law are not: the correct move depends on particulars the reasoner has to already hold, and on judgment about which facts apply here, now, to this case.

What makes this more than a training-data story is *how* contested-domain expertise actually gets settled. In human practice, law and medicine resolve hard questions through argument quality, social authority, cultural context, and interpersonal trust — not through probability. AI systems instead settle disagreements by ranking chain-of-thought likelihoods, and in exactly the contested domains where human judgment matters most, that mismatch amplifies errors rather than correcting them How do LLM debates differ from human expert consensus?. The thing math doesn't need — a social, contextual arbiter — is the thing law and medicine run on.

There's a further layer worth knowing: even where models look competent socially, they master the statistics of norms while missing actual participation and culturally-resonant interpretation Why do AI systems fail at social and cultural interpretation?. Contextual judgment isn't just "more facts" — it's situated meaning-making, the capacity to read what a situation calls for. And reasoning itself may be more about *form* than genuine inference: illogical chain-of-thought exemplars perform nearly as well as valid ones, suggesting models learn the shape of reasoning rather than the substance Does logical validity actually drive chain-of-thought gains?. In math, that imitation of form is often enough to land the answer. In law and medicine, the form without the situated knowledge and the authority to judge is precisely where it breaks.

The quietly surprising takeaway: the math-vs-medicine gap isn't about difficulty. Medicine isn't "harder reasoning" — it's a different *kind* of competence, one where knowing and judging-in-context outrank deriving, and where the engine that makes AI good at math is the same engine that can make it worse at the things humans most want judgment for.


Sources 6 notes

Does medical AI need knowledge or reasoning more?

The KI/InfoGain framework reveals that medical domain accuracy correlates more strongly with knowledge correctness than reasoning quality, while mathematical domains show the inverse pattern. This distinction has direct implications for which training strategies to prioritize in each domain.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Why doesn't mathematical reasoning transfer to medicine?

R1-distilled reasoning models fail to outperform base models on medical tasks because knowledge accuracy matters more than reasoning quality in medicine—the opposite of math. Fine-tuning cannot close this gap without domain-specific training data.

How do LLM debates differ from human expert consensus?

Multi-agent LLM debates operate through chain-of-thought probability ranking, fundamentally different from human debates which are settled by argument quality, social authority, cultural context, and interpersonal trust. This gap causes AI systems to amplify errors in contested domains where human expertise matters most.

Why do AI systems fail at social and cultural interpretation?

LLMs achieve 100th-percentile performance on norm prediction yet regress on theory-of-mind tasks and cannot generate culturally-resonant interpretations. The pattern shows that statistical competence coexists with absence of actual social understanding and participation.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about domain-specific LLM competence. The precise question: *Why does contextual judgment matter more in law and medicine than in mathematics?* A curated library (2023–2025) has proposed a tension between knowledge-dominant and reasoning-dominant domains. Treat those findings as dated constraints; your job is to surface whether newer work has dissolved or confirmed them.

What a curated library found — and when (dated claims, not current truth):
• Medical accuracy correlates far more with whether a model *knows* the right answer than with reasoning skill; pushing reasoning can degrade medical performance by disturbing lower knowledge layers (2025-06, 2025-07).
• General reasoning does not transfer to knowledge-intensive domains via supervised fine-tuning because reasoning is a form-learning signal disconnected from situated domain knowledge (2024–2025).
• Logically invalid chain-of-thought prompts perform nearly as well as valid ones, suggesting models learn the *shape* of reasoning rather than genuine inference (2023-07).
• AI debate systems rank disagreements by chain-of-thought likelihood, whereas human experts in contested domains rely on social authority, cultural context, and situated judgment — a mismatch that amplifies error in law and medicine (2025-07).
• Models master statistics of social norms while missing actual participation and culturally-resonant interpretation; contextual judgment is situated meaning-making, not just "more facts" (2025-08).

Anchor papers (verify; mind their dates):
• arXiv:2506.02126 — Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains (2025-06)
• arXiv:2507.18178 — Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory (2025-07)
• arXiv:2307.10573 — Invalid Logic, Equivalent Gains (2023-07)
• arXiv:2508.19004 — AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms (2025-08)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, ask: have newer training methods (RL from human feedback, process reward models, multi-step verification), orchestration (multi-agent debate, fact-checking harnesses, retrieval-augmented generation), or evaluations (long-context medical benchmarks, adversarial law cases) since relaxed or inverted these limits? Separate the durable question (likely still open: *what is contextual judgment?*) from the perishable claim (e.g., "reasoning doesn't transfer" — does it, with better RL?). Cite what resolved it or say plainly where constraint still holds.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. The library's tension (knowledge vs. reasoning) is itself contested; which papers undercut or reframe it?

(3) **Propose 2 research questions that ASSUME the regime may have moved.** If RL, process rewards, or multi-agent orchestration have begun to *couple* reasoning and domain knowledge, what becomes the next frontier? If models *do* learn situated judgment, how?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines