LLM Reasoning and Architecture Reinforcement Learning for LLMs Design & LLM Interaction

Why do models fail at asking good questions during interaction?

When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.

Note · 2026-04-18 · sourced from Reasoning Methods CoT ToT
How do LLMs fail to know what they seem to understand? What makes chain-of-thought reasoning actually work?

AR-Bench introduces a critical distinction: passive reasoning (all information given, solve the problem) versus active reasoning (information must be sought through interaction). This distinction exposes a capability gap that standard benchmarks completely miss.

The results are stark. On number guessing — a task with well-defined information-theoretic structure — GPT-4o achieves only 35%. The information gain curve reveals why: models extract 7.7% information gain in rounds 5-10, but this drops to just 2.5% in rounds 20-25. More interaction does not proportionally reduce uncertainty. The models plateau because they cannot formulate increasingly precise questions — they ask vague, repetitive queries that fail to efficiently partition the remaining hypothesis space.

What makes this finding particularly damaging is the intervention analysis. SFT, DPO, Tree-of-Thought, human-written instructions, Proactive CoT, and Uncertainty-of-Thought (UoT) all provide minimal benefit. The active reasoning deficit is not a prompting problem or a fine-tuning problem — it appears to be a structural limitation in how current models represent and reduce uncertainty through sequential interaction.

This connects directly to Can models identify what information they actually need?, which showed models cannot identify what information is missing even when they can solve the fully-specified version. AR-Bench extends this from identification to acquisition: even when the model has the opportunity to ask questions, it cannot formulate effective ones. The deficit spans the full pipeline — detection, formulation, and iterative refinement of information needs.

The connection to Why do RL agents stop asking informative questions? is structural: both describe systems that fail to escape low-information states. Self-locking describes the mechanism (weak belief tracking creates a trap); AR-Bench measures the behavioral consequence (plateau in information gain despite continued interaction).

The early plateau pattern also resonates with Does more thinking time always improve reasoning accuracy? — both reveal non-monotonic returns to continued processing, whether through more thinking tokens or more interaction rounds. The mechanism differs (overthinking vs. question quality degradation) but the failure mode is analogous: more compute/interaction without better strategy yields diminishing or negative returns.

Since Can models learn to ask clarifying questions instead of guessing?, the AR-Bench results suggest that even proactive critical thinking may be insufficient — the bottleneck is not willingness to ask but ability to ask well.


Source: Reasoning Methods CoT ToT

Related concepts in this collection

Concept map
14 direct connections · 130 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

active reasoning through interaction is dramatically harder than passive reasoning — models plateau early and ask vague repetitive questions