How does ambiguity detection connect to models' ability to ask clarifying questions?
This explores whether a model noticing that something is unclear (ambiguity detection) is the same skill as — or a prerequisite for — actually asking a good clarifying question, and where the two come apart.
This explores whether a model noticing that something is unclear (ambiguity detection) is the same skill as actually asking a good clarifying question, and where the two come apart. The short version from the corpus: detection and asking are separable abilities, and most models are weak at the first and untrained at the second. On detection alone, the picture is grim — GPT-4 correctly disambiguates only 32% of deliberately ambiguous cases against 90% for humans, because it can't hold several interpretations in mind at once Can language models recognize when text is deliberately ambiguous?. So before a model can ask about ambiguity, it often fails to register that ambiguity exists.
But detection turns out to be the easier half. One of the sharpest findings here is that being good at solving a problem does not transfer to noticing what's missing from it: models that ace complete reasoning tasks fall to 40–50% when one variable is withheld and they must figure out what to ask Can models identify what information they actually need?. Information-gathering and problem-execution are different cognitive operations Can models identify what information they actually need?. The same gap shows up as a behavioral failure mode — reasoning models will overthink an ill-posed question for paragraphs rather than say 'this can't be answered,' because training rewards producing reasoning steps but never teaches when to disengage Why do reasoning models overthink ill-posed questions?.
Why don't models just ask? The corpus points to the reward structure, not a missing capability. Standard RLHF optimizes for the immediately helpful next turn, which actively discourages clarifying questions — answering now scores better than asking now even when asking would lead to a better outcome later Why do language models respond passively instead of asking clarifying questions?. Change the objective and the behavior appears: training for proactive critical thinking lifted the rate of correctly flagging flawed problems from near-zero to 74% Can models learn to ask clarifying questions instead of guessing?, and reframing training so conversation is treated as an information source makes clarifying questions emerge even though the model was only ever trained on fully-specified problems Can models learn to ask clarifying questions without explicit training?.
The most interesting twist is that detecting ambiguity and asking a *good* question are themselves separate. It's not enough to fire off any question — quality has structure. The ALFA framework breaks question quality into theory-grounded attributes like clarity, relevance, and specificity, and training on those attributes beats training on a single quality score, especially in clinical reasoning where the right question changes the decision Can models learn to ask genuinely useful clarifying questions?. And small models can be pushed toward better detection through structure rather than scale: a leader-follower debate where one agent proposes interpretations and others challenge them got Mistral-7B to 76.7% on ambiguity detection Can structured debate roles help small models detect ambiguity?.
So the chain is really three links, each a distinct failure point: register that the input is ambiguous, decide it's worth asking instead of guessing, and frame a question that actually closes the gap. A related thread worth pulling is *why* models barrel ahead on vague input — 'context collapse,' where instead of asking, the model fills the silence with blended training-data priors Why do large language models produce generic responses to vague queries?. Calibration sits underneath all of it: models that know when they're uncertain enough to abstain can match models ten times their size Can models learn to abstain when uncertain about predictions? — and abstaining is the quiet cousin of asking.
Sources 10 notes
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.
Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.
Mistral-7B achieved 76.7% accuracy in ambiguity detection through a protocol where a leader proposes interpretations and two followers challenge them with rotating roles. Role rotation and consensus forcing prevent persuasive framing failures and create stronger verification than pairwise debate.
Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.