Why do models fail at asking good questions during interaction?
When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.
AR-Bench introduces a critical distinction: passive reasoning (all information given, solve the problem) versus active reasoning (information must be sought through interaction). This distinction exposes a capability gap that standard benchmarks completely miss.
The results are stark. On number guessing — a task with well-defined information-theoretic structure — GPT-4o achieves only 35%. The information gain curve reveals why: models extract 7.7% information gain in rounds 5-10, but this drops to just 2.5% in rounds 20-25. More interaction does not proportionally reduce uncertainty. The models plateau because they cannot formulate increasingly precise questions — they ask vague, repetitive queries that fail to efficiently partition the remaining hypothesis space.
What makes this finding particularly damaging is the intervention analysis. SFT, DPO, Tree-of-Thought, human-written instructions, Proactive CoT, and Uncertainty-of-Thought (UoT) all provide minimal benefit. The active reasoning deficit is not a prompting problem or a fine-tuning problem — it appears to be a structural limitation in how current models represent and reduce uncertainty through sequential interaction.
This connects directly to Can models identify what information they actually need?, which showed models cannot identify what information is missing even when they can solve the fully-specified version. AR-Bench extends this from identification to acquisition: even when the model has the opportunity to ask questions, it cannot formulate effective ones. The deficit spans the full pipeline — detection, formulation, and iterative refinement of information needs.
The connection to Why do RL agents stop asking informative questions? is structural: both describe systems that fail to escape low-information states. Self-locking describes the mechanism (weak belief tracking creates a trap); AR-Bench measures the behavioral consequence (plateau in information gain despite continued interaction).
The early plateau pattern also resonates with Does more thinking time always improve reasoning accuracy? — both reveal non-monotonic returns to continued processing, whether through more thinking tokens or more interaction rounds. The mechanism differs (overthinking vs. question quality degradation) but the failure mode is analogous: more compute/interaction without better strategy yields diminishing or negative returns.
Since Can models learn to ask clarifying questions instead of guessing?, the AR-Bench results suggest that even proactive critical thinking may be insufficient — the bottleneck is not willingness to ask but ability to ask well.
Source: Reasoning Methods CoT ToT
Related concepts in this collection
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
AR-Bench extends from identifying missing info to acquiring it through interaction; both capabilities are deficient
-
Why do RL agents stop asking informative questions?
RL-trained agents often fail to seek information effectively, despite being trained to do so. Understanding whether this reflects a capability gap or a training dynamics problem could reveal how to unlock better information-seeking behavior.
structural parallel: both describe failure to escape low-information states
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
analogous plateau: more interaction rounds, like more thinking tokens, yield diminishing returns without better strategy
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
AR-Bench challenges whether proactive asking is sufficient; question quality, not willingness, is the bottleneck
-
Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
related multi-turn failure: premature assumptions prevent effective information gathering
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
active reasoning through interaction is dramatically harder than passive reasoning — models plateau early and ask vague repetitive questions