QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Paper · arXiv 2503.22674 · Published March 28, 2025
Reasoning Logic Internal RulesQuestion Answer SearchReasoning Critiques

Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often underspecified and only solvable by acquiring missing information. We formalize this information gathering problem as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case where only one necessary variable assignment is missing, we can evaluate an LLM’s ability to identify the minimal necessary question to ask. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with partially-observed initial states, (3) GSM-Q: human-annotated grade school math problems with one unknown variable, and (4) GSME-Q: equation-based version of GSM-Q. The LLM must select the correct clarification question from multiple options. While current models excel at GSM-Q and GSME-Q, they achieve only 40-50% accuracy on Logic-Q and Planning-Q. Analysis shows that the ability to solve well-specified reasoning problems is not sufficient for success on our benchmark: models struggle to identify the right question even when they can solve the fully specified version. This highlights the need for specifically optimizing models’ information acquisition capabilities. Code and dataset are available at https://github.com/google-deepmind/questbench.

QuestBench has the following multi-choice problems, along with the correct solutions for each:

• Logic-Q: Logical reasoning tasks where the truth value of a missing proposition is needed to determine the correctness of a claim.

• Planning-Q: Blocksworld planning problems in Planning Domain Definition Language (PDDL) (Ghallab et al., 1998), with partially observed initial states, where one additional observation is needed to disambiguate the shortest path to a goal.

Ambiguity in user requests. Natural language queries often contain ambiguity for a variety of reasons. Prior work has examined ambiguity in the context of semantics (Kuhn et al., 2023b), factual question-answering (Min et al., 2020), task-oriented dialogue intents (Budzianowski et al., 2018; Rastogi et al., 2020; Zhang et al., 2024b), personalized human preferences (Chen et al., 2024a; Handa et al., 2024; Li et al., 2023), and text-to-image generation (Hahn et al., 2025). Chandu et al. (2024) presents a visual question answering benchmark to identify epistemic and aleatory uncertainty, though the distinction between the two types of uncertainties can often be unclear. Zhang et al. (2024a) introduces a taxonomy of ambiguity, categorizing it into issues like unfamiliarity and different semantic question types (e.g., “who,” “what," “where").

In this paper, we focus on under specification instead of ambiguity, where the user has not provided enough information for the LM to fulfill the request. This situation can arise because users may not know what information the model lacks, or what information is necessary to complete the task. We evaluate LLMs’ ability to address under specification in structured reasoning tasks.

Information gathering benchmarks for LLMs. Most existing benchmarks focus on subjective or ambiguous tasks where there may be multiple valid clarifying questions, depending on context and user preference (Aroyo and Welty, 2015; Basile et al., 2021; Davani et al., 2022; Sandri et al., 2023; Wan et al., 2023). Task-oriented dialogue benchmarks (Budzianowski et al., 2018; Rastogi et al., 2020; Zhang et al., 2024b) and preference elicitation tasks (Li et al., 2023) involve inherently subjective problems where no universal “right” question exists. This makes objective evaluation of information-gathering abilities difficult in these settings. In contrast, our work focuses on reasoning tasks with a clearly defined ground truth. For each task, the model needs to ask exactly one question, allowing for reliable evaluation of LLMs’ information-gathering capabilities.

Question-asking methods for LLMs. Several methods have been proposed to enhance LLMs’ ability to ask clarifying questions. These methods primarily address ambiguous or knowledge-based tasks, such as identifying a good recipe (Andukuri et al., 2024) or asking who won a sports event (Pang et al., 2024; Zhang and Choi, 2023). Some approaches directly prompt LLMs to ask clarifying questions (Kuhn et al., 2023a; Li et al., 2023), while others compute information gain to prioritize informative questions (Grand et al., 2024; Handa et al., 2024; Hu et al., 2024; Piriyakulkij et al., 2024). Zhang and Choi (2023) breaks down question-asking into three stages: detecting when clarification is needed, identifying the appropriate question, and responding based on new information. While these methods are promising, they primarily focus on subjective tasks or require substantial user simulation. Our work introduces a new setting that emphasizes generating accurate clarifying questions for underspecified reasoning tasks, where the correct question is objectively determinable.

  1. Problem formulation

Consider the following user request:

Example 3.1. Please solve the math problem: Janet had some eggs (variable 𝑥0) and ate one (variable 𝑥1). How many eggs does she have now (target variable 𝑦)?

The word problem can be parsed into equations 𝑦 = 𝑥0 − 𝑥1, 𝑥1 = 1. The LLM cannot compute target variable 𝑦 without knowing the value of variable 𝑥0. Other examples can be found in Figures 1 to 3. In these cases, the desired behavior is for the LLM to ask the minimal set of questions that enables it to respond to the user query.