How does proactive critical thinking enable models to identify missing information?
This explores how training models to think critically *before* answering — rather than just thinking harder — lets them notice what a problem is missing instead of plowing ahead and guessing.
This explores how training models to think critically *before* answering lets them notice what a problem is missing instead of plowing ahead and guessing. The headline result is striking: reinforcement learning pushed proactive critical-thinking accuracy on deliberately flawed math problems from a near-zero 0.15% to nearly 74% Can models learn to ask clarifying questions instead of guessing?. But the more interesting story is *why* models are so bad at this to begin with. It turns out that solving a problem and noticing what's missing from a problem are two different skills. Models that ace fully-specified reasoning tasks collapse to 40–50% accuracy the moment one variable is quietly withheld and they have to figure out which clarifying question to ask Can models identify what information they actually need?. Being a strong solver doesn't make you a good detector of gaps.
What changes with training isn't the amount of thinking — it's its *character*. In untrained models, extended 'thinking mode' actually backfires, spiraling into self-doubt that degrades performance; RL redirects that same machinery toward productive gap analysis Does extended thinking help or hurt model reasoning?. That's why the capability is described as learnable but fragile: simply giving a base model more inference-time compute made gap-detection *worse*, and only improved it after RL had reshaped how the model spends those tokens Can models learn to ask clarifying questions instead of guessing?. More thinking is not free — accuracy peaks and then declines past a token threshold, with models overthinking the easy and underthinking the hard Does more thinking time always improve reasoning accuracy?.
The corpus also shows there's more than one route to the same behavior. You don't necessarily need to train explicitly on flawed problems: social meta-learning instills the meta-strategy of treating conversation as an information source, so models trained only on *complete* problems still generalize to underspecified ones by asking for what they need and delaying their answer Can models learn to ask clarifying questions without explicit training?. A different angle skips asking entirely and lets generation surface the gap: a model's own partial answer reveals information needs the original query couldn't express, which you can feed back as a fresh retrieval query Can a model's partial response guide what to retrieve next?. Identifying missing information, it turns out, can happen by asking, by retrieving, or by noticing the holes in your own draft.
There's a structural failure lurking underneath all of this. Reasoning models tend to 'wander' and abandon promising paths prematurely — they explore like tourists, not scientists — which means the building blocks of good gap-detection (committing to a line of inquiry, recognizing when it's incomplete) are exactly what untrained reasoning lacks Why do reasoning models abandon promising solution paths?. Training on messy search processes that include mistakes and backtracking produces markedly better problem-solvers, suggesting that exposure to the *experience* of incomplete information teaches models to handle it Does training on messy search processes improve reasoning?.
Worth knowing for anyone trying to engineer this cheaply: the easy levers mostly don't work. Telling a model it's being watched doesn't make its reasoning more faithful Does telling models they are watched improve reasoning faithfulness?, and structured prompting can sharpen a related skill — staged prompting lifts cognitive-distortion detection by over ten percent by separating assessment from analysis Can structured prompting improve cognitive distortion detection? — but the deep result stands: proactively spotting what's missing is a trained disposition, not a prompt you can bolt on.
Sources 10 notes
Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.
Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.
ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.
Prompting models that their reasoning is monitored has no effect on hint omission rates. This suggests CoT generation is not modulated by perceived social context, ruling out prompt-engineering fixes and certain safety monitoring assumptions.
DoT prompting separates subjectivity assessment, contrastive reasoning, and schema analysis to achieve 10%+ improvement over zero-shot ChatGPT. Expert evaluators rated the resulting explanations as clinically useful for case formulation.