Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
QuestBench formalizes a capability that real-world deployment requires but benchmarks ignore: when a task is underspecified, can the model identify what information is missing and ask the right clarifying question?
The benchmark presents reasoning tasks (logic, planning, math) where exactly one piece of information is withheld. The model must select the correct clarification question from multiple options. The key finding: while current models excel on math variants (GSM-Q, GSME-Q), they achieve only 40-50% accuracy on Logic-Q and Planning-Q.
The critical insight is the separability result: models that solve the fully-specified version of a problem still fail to identify the right question when one variable is missing. Problem-solving capability and information-gathering capability are distinct cognitive operations. The ability to execute reasoning when all inputs are present does not transfer to recognizing which input is absent.
This extends Why do reasoning models overthink ill-posed questions? from a complementary angle. That note documents the BEHAVIORAL response to missing information (overthinking, redundant self-doubt). This documents the DIAGNOSTIC failure — models can't even identify what's missing, let alone respond appropriately. Together they describe a two-part deficit:
- Cannot detect what information is needed (QuestBench)
- Cannot disengage when information is absent (missing premises overthinking)
The connection to Can language models recognize when text is deliberately ambiguous? is structural: both involve recognizing that the current input is insufficient for a definitive answer. Ambiguity recognition asks "is this input multiply interpretable?" while information gathering asks "is this input incomplete?" Both require meta-reasoning about the input rather than reasoning within it.
The formalization as a constraint satisfaction problem (CSP) with missing variable assignments is useful: it defines information gathering as identifying the minimal necessary question — a well-defined optimization target. This separates the problem from subjective clarification tasks where multiple valid questions exist.
Source: Reasoning Logic Internal Rules
Related concepts in this collection
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
behavioral response to missing info; this is the diagnostic failure
-
Can language models recognize when text is deliberately ambiguous?
Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.
shared structure: recognizing input insufficiency
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
reasoning training suppresses both abstention and information gathering
-
Why do LLMs struggle to connect unrelated entities speculatively?
LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.
evidence organization (well-specified) vs hypothesis generation (underspecified) is the same split
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
proactive critical thinking is the trainable solution to the information-gathering deficit: RL training raises missing-information detection from 0.15% to 73.98%, directly addressing the capability gap QuestBench identifies
-
How do users actually form intent when prompting AI systems?
Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
intent maturation requires recognizing what information is missing from underspecified user expressions, which is exactly the capability QuestBench shows models lack
-
Why do language models lose performance in longer conversations?
Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
the Mediator-Assistant architecture addresses the QuestBench deficit by separating intent understanding (where missing-information detection is needed) from task execution (where well-specified reasoning suffices)
-
Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
under-abstention compounds the underspecification problem: reasoning-trained models are both unable to identify missing information (this note) and trained to force answers regardless (that note), creating a compound failure on underspecified inputs
-
Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
the conversational manifestation of the information-gathering deficit: when instructions arrive gradually (the normal case), models that cannot identify what's missing make premature assumptions instead, producing the 39% multi-turn degradation
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
the user-side complement: QuestBench shows AI cannot identify what information is missing; ASK shows users cannot articulate what knowledge they lack; when both sides of the interaction have information-gathering deficits, neither can help the other resolve underspecification
-
Why do AI agents misalign with what users actually want?
UserBench explores how often AI models fully understand user intent across multi-turn interactions. The study reveals that human communication is underspecified, incremental, and indirect — traits that challenge current models to actively clarify goals.
UserBench quantifies the practical cost of the information-gathering deficit: models that cannot identify missing information from underspecified tasks achieve only 20% full intent alignment because three core traits of user communication (underspecification, incrementality, indirectness) demand exactly the capability QuestBench shows models lack
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
solving well-specified reasoning problems is insufficient for identifying missing information in underspecified tasks