Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications

Paper · arXiv 2508.00669 · Published August 1, 2025

We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning.

A key breakthrough was Chain-of-Thought (CoT) prompting, which demonstrated that by instructing a model to generate step-by-step reasoning, its latent inferential capabilities could be elicited, significantly improving performance on logical tasks (Wei et al., 2023; Nachane et al., 2024).

Building on this insight, the focus has shifted from prompting techniques to architecting models where reasoning is a primary design objective. State-of-the-art models, such as OpenAI’s o1, now integrate supervised fine-tuning on explicit reasoning traces and reinforcement learning from human feedback (RLHF) to reward logically sound processes (OpenAI et al., 2024; Pan et al., 2025b).

Recent advances have demonstrated that structured clinical reasoning approaches significantly enhance diagnostic accuracy (Sonoda et al., 2025). Therefore, Reasoning LLMs are defined not just by their performance but by their designed capacity for transparent inference—a critical feature for their safe application in high-stakes domains (Wu et al., 2025).

3.1 Reasoning over Text

Textual data, found in clinical notes, dialogues, and medical literature, is information-dense but lacks inherent logical structure. The primary challenge is to guide the model’s generative process along a factually correct and clinically valid inferential path. Research has coalesced around three main strategies. First, to impose structure, models are trained to make their reasoning explicit. Techniques like explicit path generation guide models to produce step-by-step rationales, either by aligning with clinical inference patterns (Xu et al., 2024) or by grounding each step in a structured knowledge graph (Wu et al., 2025). Second, to ensure the validity of these paths, researchers focus on enforcing logical consistency. This is achieved by incorporating formal methods like first-order logic (FOL) to verify claims (Zafar et al., 2025) or by using reinforcement learning to reward factual correctness and penalize hallucinations (Zhang et al., 2025a; Liu et al., 2025b). Third, to move beyond single, linear paths, other work explores deepening and broadening inference. This includes testtime scaling (TTS) to allocate more computation for deeper reasoning on a single problem (Huang et al., 2025a; Yu et al., 2025; Shi et al., 2024), and multi-agent systems that simulate collaborative debate or dialogue to explore diverse perspectives and build a more robust, explainable consensus (Hong et al., 2024; Tang et al., 2024; Zhu and Wu, 2025).

4.1 Training-time Techniques: Building the Foundation

Training-time methods are high-cost, high-impact interventions that aim to bake clinical logic directly into the model’s parameters. They represent the "heavy lifting" of creating a domain-specialized reasoner but face significant challenges related to data scalability.

4.1.1 Supervised Fine-tuning (SFT)

SFT marks a crucial epistemological shift from learning mere correlations to learning clinical processes. By training on data containing explicit reasoning chains, the model is forced to learn the "how" and "why" of a diagnosis. The innovation lies in the design of this data and the training strategy. Multi-stage Fine-tuning: This approach applies the principle of curriculum learning, recognizing that complex clinical reasoning cannot be learned monolithically. The core idea is to break the skill into a sequence of manageable stages. We identify two primary strategies for this. The first, staging by task abstraction, is common for conceptual reasoning. It builds a hierarchy from concrete knowledge to abstract inference. FineMedLM-o1 (Yu et al., 2025) exemplifies this by first training on factual medical knowledge, then on interactive dialogues, and finally on complex causal reasoning. The second strategy, staging by modality integration, is crucial for multimodal tasks. It builds skills from perception to interpretation. AOR (Li et al., 2025c) is a canonical example, sequentially training the model to first recognize anatomical structures, then ground them to linguistic terms, and finally synthesize this information into a diagnostic conclusion.

Chain-Aware Fine-tuning: This paradigm’s challenge is the ’supervision bottleneck’—the cost and difficulty of obtaining high-quality reasoning chains. Researchers have developed four distinct strategies to address this. The gold standard is to use human expert annotation, where clinicians provide the consensus and rationale for reasoning paths, as done for Med-PaLM 2 (Singhal et al., 2025). While authoritative, this is not scalable. To overcome this, the most common strategy is using AI-generated chains. Frameworks like HuatuoGPT-o1 (Chen et al., 2024a) leverage powerful teacher models (e.g., GPT-4) in a sophisticated cycle of generation, verification, and selfcorrection to create vast datasets at scale, though this risks inheriting the teacher’s biases. A third, more verifiable approach is to impose external structure. MedReason (Wu et al., 2025), for example, constrains generation by forcing the reasoning path to be a valid traversal of a medical knowledge graph, making each step auditable. Finally, some methods focus on refining existing data. This includes data-centric approaches like BioMed-R1 (Thapa et al., 2025), which filters multiple benchmarks to curate a dataset of only the most reasoning-intensive samples, and stylistic approaches like EMULATION (Xu et al., 2024), which fine-tunes the model to ensure its reasoning style authentically mimics the abductive and deductive thought processes of clinicians.

Position-Aware Fine-tuning: For multimodal reasoning to be clinically useful, a diagnostic claim must be grounded to a specific visual location. This technique directly tackles this critical "grounding problem" by training models on data that enforces spatial correspondence. The strategies vary by the granularity of the spatial information provided. The most foundational approach uses coarse-grained grounding with bounding boxes, which serve as explicit intermediate reasoning steps in models like MedGround-R1 (Xu et al., 2025a). For higher clinical precision, fine-grained grounding with pixel-level segmentation masks is employed. PRSMED (Trinh et al., 2025), for instance, trains on precise masks and requires the model to answer questions about these specific regions. The most sophisticated strategy is semantically-rich grounding, where visual regions are linked to a formal medical vocabulary. AOR (Li et al., 2025c) does this by aligning image areas with concepts from an anatomical ontology, allowing the model to reason not just about "where" a finding is, but also "what"it is in a structured, clinically meaningful way.

4.1.2 Reinforcement Learning (RL)

If SFT provides raw capability, RL is the alignment phase sculpting this capability to fit the nuanced goals of clinical practice: safety, accuracy, and efficiency. The core challenge in applying RL is defining "good" clinical reasoning, which has led to a spectrum of feedback strategies, from holistic human judgment to granular, automated metrics. At one end of this spectrum lies alignment with subjective, qualitative feedback. RL with Human Feedback (RLHF) directly captures complex clinical values by training on physician preferences. For instance, Med-PaLM 2 (Singhal et al., 2025) was optimized using preference rankings from a diverse panel of physicians, allowing it to learn intangible qualities like diagnostic prudence and safety. To address the significant cost and scalability limitations of RLHF, RL with AI Feedback (RLAIF) has emerged as a pragmatic alternative. HuatuoGPT-o1 (Chen et al., 2024a), for example, uses GPT-4o to provide scalable, binary reward signals on answer correctness, using a powerful AI as a proxy for human judgment.

At the other end of the spectrum lies optimization against objective, quantitative metrics using Structured Rewards. This engineering-driven approach offers scalability and reproducibility by defining explicit, measurable goals. This has become a powerful trend, with the policy optimization algorithm GRPO being widely used to train models on specific criteria. These include multifaceted rationale quality (e.g., accuracy, coherence, and knowledge coverage in ClinRaGen (Niu et al., 2025)), precise multimodal grounding (e.g., spatial and semantic consistency in MedGround- R1 (Xu et al., 2025a)), and task-specific performance across diverse modalities and question types (e.g., Med-R1 (Lai et al., 2025), GMAI-VL-R1 (Su et al., 2025)).

Perhaps the most profound insight from this line of work is that RL can act as an "emergence engine" for complex reasoning. Studies like AlphaMed (Liu et al., 2025b) and BioMed-R1 (Thapa et al., 2025) demonstrate that by using simple, objective rewards (like multiple-choice accuracy) and focusing on a curated set of difficult problems, sophisticated reasoning capabilities can emerge without being explicitly taught via CoT distillation. This crucial finding challenges the "bigger is better" paradigm, suggesting a viable path toward creating smaller, more efficient, yet highly capable medical reasoning models.

4.2 Test-time Techniques: Achieving Agility and Verifiability

In contrast to costly retraining, test-time techniques offer a flexible, low-cost way to steer the reasoning of pre-trained models. These on-the-fly mechanisms represent a conceptual shift from viewing the LLM as a static oracle to a dynamic reasoning component. The strategies show a clear progression in sophistication, from simple input shaping to complex, multi-agent orchestration.

4.2.1 Prompt-based Reasoning Elicitation

This foundational technique uses structured prompts for "cognitive steering," compelling the model to externalize its latent thought process into an explicit, step-by-step format. The approach has evolved from generic Chain-of-Thought (CoT) prompting (Nachane et al., 2024) to domainspecific variants that emulate expert workflows. These include Clinical CoT (Kwon et al., 2023), Diagnostic Reasoning CoT (DR-CoT) (Wu et al., 2024a), and the formalized five-step Chain of Diagnosis (CoD) (Chen et al., 2024b), which breaks down diagnosis into explicit steps like symptom analysis and diagnostic testing. For more complex problems, techniques like least-to-most prompting decompose tasks into simpler sub-problems (He et al., 2024), while other methods use iterative questioning to verify claims (Vladika et al., 2025) or extend these concepts to orchestrate multimodal analysis (Wei et al., 2024b; Zhou et al., 2025).

4.2.2 Reasoning Selection and Aggregation

To mitigate the inherent stochasticity of LLM outputs, this pillar improves robustness by generating and evaluating multiple reasoning paths. The methods represent different points on a spectrum of computational cost versus performance gain. At the higher-cost end, self-consistency (Wang et al., 2023) and ensemble reasoning (Lucas et al., 2024) generate multiple candidate responses by introducing randomness during decoding and then select the most frequent or highest-quality answer via majority vote. Other methods invest more computation into a single, more exhaustive path through testtime scaling (Huang et al., 2025a,b). Another way, test-time adaptation uses a small, lightweight model like MedAdapter (Shi et al., 2024) as a posthoc ranker to score and select the most clinically plausible solution from a pool of candidates generated by a much larger base model, achieving significant gains with minimal overhead.

4.2.3 Knowledge-Enhanced Reasoning

These techniques address the critical issues of hallucination and outdated knowledge by grounding the model’s parametric memory in verifiable, external facts. The strategies fall into two main categories. The first is "just-in-time" contextualization via Retrieval-Augmented Generation (RAG). Before answering a question, the model first queries a medical database or text corpus for relevant information, then integrates this retrieved text into its context to generate a factually grounded answer (Hammane, 2024; Jeong et al., 2024; Zhan et al., 2024). The second strategy is "just-in-place" guidance, which uses structured knowledge to constrain the generation process. In- Context Padding (ICP) (Wu et al., 2024b), for instance, injects structured "knowledge seeds" (e.g., ‘(headache, is_symptom_of, migraine)’) from a knowledge graph directly into the LLM’s context, guiding its generation along a logically sound and verifiable path.

4.2.4 Multi-agent Reasoning Systems

This frontier represents a paradigm shift from a monolithic intelligence to a distributed, specialized cognitive architecture where the LLM acts as an "orchestrator." This approach decomposes complex problems into tasks solved by multiple collaborating agents. We see two primary forms of collaboration. Collaborative deliberation frameworks simulate peer review; for example, one agent might act as a ’proposer’ suggesting a diagnosis, while another acts as a ’critic,’ challenging the evidence to force a more robust conclusion (Tang et al., 2024; Hong et al., 2024). Functional decomposition frameworks assign tasks to agents with specialized tools. This allows a central orchestrator to delegate sub-tasks to an ’imaging agent’ that can call a segmentation model (Fallahpour et al., 2025), a ’data agent’ that can execute database queries, or a ’trial agent’ that can parse clinical trial documents (Yue et al., 2024). Supported by dedicated training environments (Xu et al., 2025b), this modular approach makes the entire reasoning process transparent and auditable by design (Gu et al., 2024; Zhu and Wu, 2025).

!Pasted image 20250806120230.png