HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Paper · arXiv 2412.18925 · Published December 25, 2024
Reasoning o1 o3 Search

The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems.

Medical tasks often involve complex reasoning [10–12]. In real-world medical diagnoses or decisions, doctors often deliberate carefully. Such life-critical field necessitates meticulous thinking to ensure more reliable answers [13, 14]. Additionally, the medical domain offers unique advantages: compared to general domains, the medical domain is generally narrower in scope and easier to verify. Furthermore, medical reasoning closely resembles real-world applications in fields like finance, law, education, and security, making advancements in this area readily transferable [15, 16].

Stage 1: Learning Complex Reasoning We construct complex reasoning trajectories through strategy-based searches guided by verifier feedback (True or False). The LLM first initializes a CoT. If the verifier rejects the current CoT, the model extends the CoT by applying a strategy sampled from Backtracking, Exploring New Paths, Verification, and Correction until a correct answer is provided. Successful reasoning trajectories are then used to fine-tune the LLM, enabling it develop complex reasoning skills that embody iterative reflection.

Stage 2: Enhancing Complex Reasoning with RL After acquiring complex reasoning skills, reinforcement learning (RL) further refine this ability. Specifically, sparse rewards provided by the verifier guide self-improvement using the Proximal Policy Optimization (PPO) algorithm.