Can question-only features replace model uncertainty checks at scale?

This explores whether you can decide when an AI needs help — like when to look something up — by reading features of the *question itself*, instead of asking the model how confident it is. The corpus has a genuine disagreement sitting right at the center of this question, which is the interesting part. One line of work shows that cheap, external question features — 27 lightweight signals computed from the query alone, no model introspection required — can match complex uncertainty-based methods for deciding when to retrieve, and actually *beat* them on hard, multi-part questions, all at a fraction of the cost Can question features alone predict when to retrieve?. That's the case for 'yes, at scale, question-only features are enough.'

But the opposing result is just as strong: when you measure uncertainty *well* — using calibrated token probabilities rather than expensive multi-call heuristics — the model's own sense of what it doesn't know turns out to be more reliable than external signals, and cheaper than the elaborate retrieval pipelines people built to avoid it Can simple uncertainty estimates beat complex adaptive retrieval?. So the honest answer isn't 'question features win' — it's that the real contest is *cheap-and-well-calibrated* on both sides. A bad uncertainty check (slow, miscalibrated, many model calls) loses to question features; a good one wins. The deciding variable is calibration quality, not which signal you consult.

What makes the model-confidence side compelling is how far that internal signal reaches beyond retrieval. The same token-probability confidence can serve as a *reward* that improves reasoning while fixing the calibration that RLHF tends to wreck Can model confidence work as a reward signal for reasoning?, and can replace external verifiers entirely when training reasoning models in domains where you have no answer key Can model confidence alone replace external answer verification?. Confidence even predicts something a question feature can't see: how robust the model will be to having the prompt reworded Does model confidence predict robustness to prompt changes?. Question-only features are blind to all of that — they describe the input, not the model's grip on it.

The catch is that this internal signal is real but *undertrained*. Small models given uncertainty-aware objectives and an explicit 'I don't know' option can match models ten times larger by abstaining when they should Can models learn to abstain when uncertain about predictions? — which says the capability exists in standard LLMs but is left dormant. So 'replace uncertainty checks at scale' partly depends on whether you've bothered to train the model to know what it doesn't know. If you haven't, question features are the safer cheap bet; if you have, the internal signal carries more.

The quietly useful twist for a curious reader: sometimes the question itself is the actual problem, not the model's confidence about it. When users give too little context, models don't get *uncertain* — they confidently fall back on blended training-data priors and produce generic answers Why do large language models produce generic responses to vague queries?. The fix there isn't a better confidence check at all; it's getting the model to *ask a good clarifying question*, which is its own trainable skill Can models learn to ask genuinely useful clarifying questions?. Which reframes the whole question: the choice isn't only 'question features vs. uncertainty' — there's a third move, where the system notices the question is the weak link and pushes back on it.

Sources 8 notes

Can question features alone predict when to retrieve?

Learned predictors using 27 lightweight external question features match complex uncertainty-based methods on overall performance while costing far less, and outperform them on complex questions across 6 QA datasets.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can model confidence alone replace external answer verification?

RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Why do large language models produce generic responses to vague queries?

Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Can question-only features replace model uncertainty checks at scale?

Sources 8 notes

Next inquiring lines