Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does supervised fine-tuning improve reasoning or just answers?

Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.

Note · 2026-02-21 · sourced from Domain Specialization
How do you build domain expertise into general AI models? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Post angle for Medium / LinkedIn

Hook: "Every AI benchmark measures accuracy. What if accuracy is exactly the wrong thing to measure when deploying AI in high-stakes domains?"

The finding: The Knowledge or Reasoning paper introduces two new metrics — Knowledge Index (KI: factual correctness of each reasoning step) and Information Gain (InfoGain: how much each reasoning step reduces uncertainty toward the final answer). When they apply these metrics to SFT-trained models on medical and mathematical tasks, they find that SFT raises final-answer accuracy while cutting InfoGain by 38.9%. Models get more answers right while reasoning toward them less informationally.

The mechanism: SFT rewards answers, not reasoning paths. Training data has question-answer pairs. The loss function anchors on the correct final output. Models learn the most efficient path to the right answer in the training distribution — often domain-specific shortcuts, pattern matches, and frequency-weighted heuristics that produce the correct answer without the inferential chain that would justify it. The reasoning in the output becomes post-hoc rationalization.

Why this matters for deployment: High-stakes domains don't just need correct answers — they need auditable reasoning. Medical decision support must show clinical logic. Legal AI must demonstrate how conclusions follow from statute and precedent. Financial AI must show how recommendations connect to market data and regulatory context. SFT improves the answer, but may make the reasoning path less meaningful — more verbose decoration around the correct output than the pathway that produced it.

The measurement problem: Standard benchmarks measure what's easy to measure: whether the final answer matches the ground truth. InfoGain and KI require decomposing reasoning chains and evaluating each step against external ground truth — expensive and difficult to automate at scale. So the measurement gap persists, and every organization that deploys based on benchmark accuracy is systematically blind to the reasoning quality regression.

The connection: This extends the existing cluster of overthinking findings into the training dimension. Does extended thinking actually improve reasoning or just increase variance? at inference-time. Does reasoning fine-tuning make models worse at declining to answer? at training-time for a different cost (calibration). The SFT accuracy trap is the third entry: training-time cost to reasoning quality.

Platform: Medium (1000–1400 words). Could lead with the FALM / medical AI deployment angle, then introduce the measurement framework.


Source: Domain Specialization

Related concepts in this collection

Concept map
22 direct connections · 239 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

the sft accuracy trap — training raises benchmark scores while degrading reasoning quality