INQUIRING LINE

Can models distinguish between ambiguous and incomplete information inputs?

This explores whether models can tell apart two different problems with their inputs: ambiguity (one input that has several valid readings) versus incompleteness (an input missing something the model actually needs to answer).


This explores whether models can tell apart two genuinely different input failures — ambiguity, where a single input admits several valid interpretations, and incompleteness, where the input is clear but missing a fact the model needs. The corpus treats these as separate cognitive operations, and the headline is that models handle them unevenly: they are surprisingly weak at noticing ambiguity, and capable-but-fragile at noticing incompleteness.

On ambiguity, the picture is stark. The AMBIENT benchmark finds GPT-4 correctly disambiguates only 32% of cases against 90% for humans, across lexical, structural, and scope ambiguity Can language models recognize when text is deliberately ambiguous?. The diagnosis is that models can't hold multiple interpretations in mind at once — they collapse to one reading and run with it. That's a different muscle than detecting a gap. Incompleteness work shows models that ace fully-specified reasoning problems crash to 40–50% when asked what clarifying question to ask after one variable is withheld Can models identify what information they actually need?. Being good at solving does not transfer to noticing what's absent. So the two skills the question asks about aren't just distinct from each other — each is also distinct from raw problem-solving ability.

The encouraging counterweight is that the incompleteness skill is trainable. Reinforcement learning lifted proactive identification of missing information from essentially zero (0.15%) to 74% on deliberately flawed problems — though the same paper notes inference-time scaling *degraded* the ability in untrained models and only helped after explicit RL, marking it as learnable but brittle Can models learn to ask clarifying questions instead of guessing?. There's also an indirect route: a model's own partial answer can surface gaps the original query never expressed, turning generation itself into a gap-detector Can a model's partial response guide what to retrieve next?.

Here's the lateral connection worth knowing: ambiguity and incompleteness both reduce to a model's grip on its own uncertainty, but they show up in different signals. Semantic entropy measures uncertainty *over meanings* — clustering answers by what they mean rather than how they're worded — which is essentially a quantitative read on ambiguity, on how many distinct interpretations the model is entertaining Can we detect when language models confabulate?. Incompleteness, by contrast, is better caught by calibrated token-probability uncertainty deciding when to abstain or go retrieve more Can simple uncertainty estimates beat complex adaptive retrieval?, Can models learn to abstain when uncertain about predictions?. So the corpus implicitly says: meaning-level uncertainty flags ambiguity; confidence-level uncertainty flags missing information. A model that conflates the two will mistake "I have several readings" for "I'm missing a fact," and respond wrongly to both.

The quiet warning underneath all of this: models often don't even register the problem. They accommodate false presuppositions to be agreeable rather than flag them Why do language models agree with false claims they know are wrong?, and they override what's in front of them when training priors are strong Why do language models ignore information in their context?. Distinguishing ambiguous from incomplete inputs presupposes the model first noticed something was off — and the corpus suggests that first step is the one most likely to be skipped.


Sources 9 notes

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Can models identify what information they actually need?

Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Can a model's partial response guide what to retrieve next?

ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.

Can we detect when language models confabulate?

Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Next inquiring lines