Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.
Such evaluations obscure the step-by-step process by which models reason and offer little insight into the interplay between factual knowledge and logical inference that underlies these capabilities.
Earlier work [14] evaluates reasoning based on its embedding similarity to the original question, assuming higher similarity implies greater informativeness and faithfulness. However, LLMs often rely on internal knowledge or previous deductions, making question alignment an unreliable measure of knowledge accuracy or reasoning quality. As shown in Tab. 2 in the Appendix, existing reasoning metrics yield similar scores across models but with differing capacities, suggesting their unreliability.
For instance, mathematical problems often emphasize symbolic manipulation and internal consistency [5], whereas medical tasks typically require the integration of domain-specific knowledge grounded in external facts [13]. Both domains involve multi-step reasoning, but they differ in how much they depend on knowledge versus the reasoning steps required during generation. Understanding these differences is critical not only for building domain-adaptive models but also for advancing interpretability and reliability in high-stakes applications.
In this work, we pose a fundamental question: What are the respective roles of knowledge and reasoning in the thinking process of LLMs, and how do they interact across different domains? To answer this, we introduce an evaluation framework (see Fig. 2) that decomposes each reasoning step into two components: the factual knowledge it invokes and the logical reasoning operation it performs. We define two novel metrics to quantify reasoning and knowledge: (1) Information Gain (Info Gain) how much a reasoning step reduces uncertainty toward the final answer, calculated as the probability gap between adjacent response steps. A higher Info Gain indicates a more informative reasoning path towards the final answer. (2) Knowledge Index (KI), on the contrary, evaluates the factual correctness of each step by verifying extracted knowledge against external ground truth sources. We identify the knowledge point in each reasoning step and access external factual data to verify if the knowledge aligns with the retrieved facts. Finally, models with stronger knowledge grounding yield higher KI scores. A running example is provided in Figure 1 to explain our motivation intuitively. This fine-grained evaluation allows us to characterize not just the model’s final performance, but also the trajectory it takes to get there.
Building upon this framework, we analyze models trained via supervised fine-tuning (SFT) and reinforcement learning (RL) across both mathematical and medical domains. Our findings reveal several key insights: (1) mathematical reasoning does not naturally transfer to the medical domain via SFT, largely due to domain-specific knowledge gaps, as evidenced by the consistently lower performance of the DeepSeek-distilled model; (2) tasks across domains demand distinct model competencies—medical problems require richer domain knowledge, with knowledge–accuracy correlations exceeding those of reasoning–accuracy in four of five benchmarks; (3) while SFT improves final accuracy and raises knowledge levels (e.g., a 6.2% average KI increase on medical tasks), it often introduces verbose or suboptimal reasoning, reducing Info. Gain by an average of 38.9%; and (4) RL mitigates such inefficiencies by reinforcing correct knowledge trajectories, boosting medical knowledge with an average KI gain of 12.4.