How can models select the optimal question to ask given multiple uncertainties?
This explores how a model decides which single question is worth asking when many things are unknown at once — not just whether to ask, but how to pick the most valuable question from many candidates.
This explores how a model decides which single question is worth asking when many things are unknown at once — picking the highest-value question, not just any clarifying prompt. The cleanest answer in the corpus is to make the model simulate the future: for each candidate question, imagine the possible answers a user might give, score how much each answer would shrink the model's uncertainty, and ask the question whose answers reduce uncertainty the most. That information-gain approach How can models select the most informative question to ask? turns 'ask a clarifying question' into an optimization problem with a principled objective, rather than a generic 'can you tell me more?' A close cousin appears in personalization, where active learning picks the questions that most sharpen an uncertain estimate of a user's preferences — and remarkably, about ten well-chosen questions are enough to pin down someone's reward coefficients Can user preferences be learned from just ten questions?. Both treat question selection as: choose the query that most collapses what you don't yet know.
But 'most informative' isn't the same as 'best.' A question can maximize information gain and still be vague, off-topic, or impossible to answer. One line of work decomposes question quality into separate attributes — clarity, relevance, specificity — and trains on each rather than on a single blended score, which matters most in high-stakes settings like clinical reasoning where the right clarifying question directly changes the decision Can models learn to ask genuinely useful clarifying questions?. So the full recipe is two-layered: information gain tells you *what to be uncertain about*, and attribute-level quality tells you *how to phrase the probe* so the answer is actually usable.
There's a prior question the corpus insists on: should the model ask at all? Standard RLHF quietly teaches models *not* to ask, because next-turn reward optimization rewards looking helpful right now over discovering what the user actually wants. Rewarding long-term interaction value instead flips this, letting models actively probe for intent Why do language models respond passively instead of asking clarifying questions?. The mirror-image failure is asking — or reasoning — when you shouldn't: models often grind out elaborate answers to questions with missing premises instead of flagging them as unanswerable, because training rewards producing reasoning steps but never teaches when to disengage Why do reasoning models overthink ill-posed questions?. Optimal question selection therefore sits between two cliffs: passively answering when it should clarify, and over-engaging when it should stop.
Underneath all of this is the model's sense of its own uncertainty, and the corpus is split on how to measure it. For a related decision — when to retrieve external information — calibrated token-probability uncertainty often beats elaborate multi-call heuristics at a fraction of the cost, suggesting a model's self-knowledge is a reliable signal Can simple uncertainty estimates beat complex adaptive retrieval?. Yet cheap *external* features of the question alone can rival uncertainty estimation, especially on hard questions Can question features alone predict when to retrieve?, and confidence itself is a usable signal — high confidence predicts robustness, low confidence predicts wild output swings Does model confidence predict robustness to prompt changes?. The catch: this whole machinery assumes models can represent uncertainty in the first place. Calibration ability exists but is undertrained — small models taught uncertainty-aware objectives and given the option to abstain can match models ten times larger Can models learn to abstain when uncertain about predictions?, and making abstention an explicitly learnable, rewarded action rather than a failure substantially cuts confident-but-wrong answers Can three-way rewards fix the accuracy versus abstention problem?.
The through-line you might not have expected: selecting the optimal question is really the same skill as deciding *whether to ask, retrieve, think harder, or abstain* — all of them are routing decisions driven by calibrated uncertainty. The corpus shows models can be trained to route between extended thinking and quick answers without difficulty labels Can models learn when to think versus respond quickly?, and even to hold several candidate solutions open at once by making their internal reasoning stochastic rather than committing early Can stochastic latent reasoning help models explore multiple solutions?. Asking the best question, in other words, is one face of a more general competence: knowing precisely what you don't know, and acting on it.
Sources 12 notes
UoT combines uncertainty-aware scenario simulation with information-gain scoring and reward propagation to identify questions whose possible answers maximally reduce diagnostic uncertainty—providing a principled mechanism for specific, high-value clarification rather than generic prompts.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Learned predictors using 27 lightweight external question features match complex uncertainty-based methods on overall performance while costing far less, and outperform them on complex questions across 6 QA datasets.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.