From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

Paper · arXiv 2506.08295 · Published June 9, 2025

While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning—where an LLM must interact with external systems to acquire missing evidence or data—has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM’s active reasoning skills. AR-Bench comprises three task families—detective cases, situation puzzles, and guessing numbers—that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for realworld deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https:// github.com/tmlr-group/AR-Bench.

However, mastering passive reasoning alone is insufficient to address many real-world challenges that involve incomplete or partial information. For instance, creating a personalized plan from a rough outline requires a travel agent to inquire about the client’s budget, preferences, and available time to craft a suitable itinerary. Likewise, a doctor must ask about the symptoms of a patient and review the results of the follow-up examination to diagnose a disease accurately. Here, we term this paradigm active reasoning (AR), that is, a model is only given partial information and must actively seek the essential information by interacting with external information sources to derive the right solution (see Fig. 1). AR is a broader, dynamic, and interactive framework that integrates questioning, retrieval, and iterative reasoning to address complex problems under incomplete information. The two areas most relevant to AR are proactive questioning (PQ) (Du et al., 2017; Wang et al., 2018; Aliannejadi et al., 2019) and retrieval-augmented generation (RAG) (Lewis et al., 2020; Karpukhin et al., 2020; Guu et al., 2020). Unlike PQ, which focuses exclusively on generating clarifying or exploratory questions, AR incorporates answers and refines its reasoning through multiple steps. In contrast to RAG, which statically retrieves and generates in a single pass, AR adapts dynamically through multi-turn interactions, drawing on diverse information sources to comprehensively solve problems. By integrating questioning (retrieval) and reasoning, AR offers a uniquely holistic task for problem-solving.

Both passive and active reasoning are vital for achieving artificial general intelligence. Passive reasoning excels with complete information, but only active reasoning’s iterative questioning, retrieval, and refinement can solve real-world tasks, e.g., personalized trip planning, medical diagnosis, code debugging, adaptive tutoring, or negotiation coaching, where key details are initially missing. Nonetheless, only a few studies have paid attention to AR (Abdulhai et al., 2023; Deng et al., 2023a;b; Hu et al., 2024; Liu et al., 2024). So far, the AR capabilities of LLMs remain largely underexplored, limiting their potential in numerous agentic applications.

Hence, given a few existing AR datasets, it is necessary and urgent to conduct a systematic evaluation with a new benchmarking dataset that is tailored to active reasoning. In this work, we construct the AR-Bench (Active Reasoning Benchmark) to provide a holistic evaluation of LLMs’ capability in active reasoning. AR-Bench comprises three tasks, i.e., detective cases (DC), situation puzzles (SP), and guessing numbers (GN), corresponding to commonsense, logical, and symbolic reasoning. For evaluation, AR-Bench provides problems that contain only partial information to a questioner model (i.e., the LLM under evaluation). This model is required to actively seek informative clues in a multi-round interaction with answerer agent(s), which correspond to the external information resources in Fig 1. The quality of 1) the asked questions and 2) the final solution is numerically quantified to assess the model’s AR capability.

Using the constructed AR-Bench, we conduct extensive experiments and reveal several empirical findings:

• Performance Gap: State-of-the-art LLMs (e.g., GPT-4o) and advanced reasoning methods underperform dramatically on AR-Bench, achieving as low as 35% exact match rate in GN, while human evaluators far exceed models.

• Early Gains, Late Plateaus: Models make rapid progress in the first few interaction rounds (+7.7% process score in rounds 5-10) but diminish in later rounds (+2.5% in rounds 20–25) and with extended question-asking scaling.

• Component Shortcomings: Unreliable verifiers and lowquality question generation severely limit search-based strategies, with verifier effectiveness varying by task.

• Scaling Limits: Larger models and more interaction rounds yield measurable improvements over small models, but still fail to fully solve active reasoning tasks.

• Method and Instruction Failures: Common approaches like SFT, DPO, Tree-of-Thought, human-crafted instructions, Proactive CoT, and UoT offer little to no benefit.

• Task-Specific Error Patterns: Models frequently ask vague or repetitive questions and make common mistakes, e.g., timeline misinterpretations in DC, unsupported assumptions in SP, and feedback misunderstandings in GN.