Should LLMs query users back when presented with under-specified scenarios?
This explores whether the right response to a vague or incomplete prompt is for the LLM to ask the user a clarifying question — and what the corpus says about when that's warranted, why models resist it, and what they fail to notice without it.
This explores whether LLMs should clarify rather than guess when a request is under-specified. The corpus's strongest "yes" comes from borrowing a tool from conversation analysis: human dialogue handles ambiguity through *insert-expansions* — small clarifying detours before a request is acted on. When should AI agents ask users instead of just searching? argues these give agents a formal trigger for when to probe the user instead of silently chaining tool calls and drifting from intent. The point is preventive: a clarifying question stops a misunderstanding before it happens, rather than recovering from a wrong answer after the fact.
Why under-specification is dangerous in the first place is sharpened by two notes that don't share the question's vocabulary. Why do large language models produce generic responses to vague queries? reframes the generic, mealy-mouthed answer you get from a vague prompt not as the model being dumb, but as it falling back on blended training-data priors when the user hasn't supplied enough scaffolding — and explicitly names query verification and user-driven context specification as the fix. Do language models fail at identifying unstated preconditions? goes further: the failure isn't missing knowledge, it's failing to bring unstated preconditions forward as relevant. When models are forced to enumerate those hidden conditions, accuracy jumps from 30% to 85%. That's a striking number — it suggests that asking (or even self-asking) the right setup questions is where most of the lost performance lives.
But there's a behavioral obstacle to LLMs querying back, and it's the most counterintuitive thread here. Why do language models avoid correcting false user claims? shows models often *won't* challenge a false or shaky premise even when they privately know better — they've absorbed a human conversational norm of preserving social harmony and avoiding the awkwardness of correction. A model that won't correct a wrong assumption is unlikely to interrupt to ask for a missing one. Do language models ignore goals when surface cues conflict? compounds this: faced with conflict between salient surface cues and unstated constraints, models followed the surface cue 8 to 38 times more often — they barrel ahead on what's visible rather than flag what's missing. So the disposition to clarify has to be engineered against a default that pushes toward smooth, confident completion.
The corpus also hints the choice isn't binary between "ask the human" and "guess." Can models decide better than retrievers which tools to use? shows models emitting structured requests iteratively, refining what they need progressively as reasoning unfolds — a model interrogating its own under-specification across turns rather than in one shot. And Can structured argument prompts make LLM reasoning more rigorous? turns clarification inward: structured critical questions force the model to check warrants and surface implicit premises it would otherwise skip. Read together, "query the user back" is one node in a wider family of moves — query the user, query a tool, or query yourself — all aimed at making the unstated explicit before committing.
The synthesis: yes, querying back is the principled response to under-specification, and the payoff is large — but it runs against a trained-in reluctance to disturb conversational harmony, so it has to be deliberately designed in, and it's only the most visible member of a broader repertoire for dragging hidden assumptions into the open.
Sources 7 notes
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.
LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
Testing 14 LLMs on 500 conflict scenarios, the Heuristic Dominance Ratio ranged from 8.7× to 38×. Distance and other salient surface cues dominated decision-making over implicit feasibility constraints, producing sigmoid mappings largely independent of the stated objective.
MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.