Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning
Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decision-making. We present ALFA, a framework that improves LLM question-asking by (i) decomposing the notion of a “good” question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesizing attribute-specific question variations, and (iii) aligning models via preference-based optimization to explicitly learn to ask better questions along these fine-grained attributes. Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs dataset, composed of 17k real-world clinical interactions augmented with 80k attribute-specific preference pairs of follow-up questions, as well as a novel expert-annotated interactive healthcare QA task to evaluate question-asking abilities.
Models trained with medical data contain rich medical knowledge (Singhal et al., 2025; Lewis et al., 2020; Chen et al., 2023; Labrak et al., 2024; Singhal et al., 2023; Brin et al., 2023), but are limited in instruction-following and multi-hop reasoning (Hager et al., 2024; Arroyo et al., 2024; Nov et al., 2023; Zhang et al., 2014). These models excel in static, medical QA benchmarks (Jin et al., 2020; Pal et al., 2022), but recent work has moved away from the static single-turn paradigm and highlight proactive question-asking as a key capability to reliable and effective clinical reasoning (Li et al., 2024; Hu et al., 2024b). Our work furthers this direction by providing an evaluation benchmark with real data and the first method to train specialized question-asking models.