Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Paper · arXiv 2507.23407 · Published July 31, 2025

Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSMMCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B’s accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problemsolving through proactive critical thinking.

For instance, a patient lacking medical knowledge might omit critical symptoms, preventing an AI doctor from making a precise diagnosis (Alkaabi and Elsori, 2025). Some prior work (Rahman et al., 2024; Kirichenko et al., 2025) has acknowledged this issue, advocating for critical thinking in LLMs, which refers to the ability to reject unanswerable or flawed requests instead of attempting to process biased or incomplete inputs. Yet, we argue that this form of critical thinking remains passive, as it still relies on users to independently identify and rectify gaps in their queries, rather than actively facilitating problem-solving.

To address this limitation, we propose proactive critical thinking: a paradigm where the model not only detects unanswerable queries but also provides constructive feedback to guide users in supplying necessary information. As shown in Figure 1, this approach fosters more effective human-AI collaboration, enabling iterative conversations that progressively refine the problem and lead to a solution.

To instill proactive critical thinking capabilities into models, we investigate both supervised finetuning (SFT) and reinforcement learning (RL). Using the data preparation pipeline described above, we combine revised unanswerable questions with original questions to construct the training set. Furthermore, we enhance both training paradigms by incorporating a heuristic information about question answerability. This approach effectively increases the diversity of the SFT data and accelerates the RL convergence through more dense reward signals.

We evaluate popular Qwen3 (Yang et al., 2025) and Llama (Meta, 2024) series models on our GSMMC and GSM-MCE benchmarks. The results reveal that despite extensive post-training, these models still struggle with proactive critical thinking, particularly the smaller ones. Notably, while recent inference-time scaling approaches have significantly advanced performance on complex reasoning tasks, we find they can hinder proactive critical thinking capability. However, our training approach successfully enhances proactive critical thinking performance while maintaining accuracy on standard questions across model sizes.

However, Song et al. (2025) demonstrates that critical thinking can be effectively improved through specific training. Nevertheless, we argue that this approach remains passive and may have limited usefulness in addressing user requests, as it still requires users to identify and correct errors themselves.

In this work, we introduce proactive critical thinking, enabling models to move beyond mere flaw detection and to actively guide users with clear and targeted feedback. Existing research (Kuhn et al., 2022; Wang et al., 2024; Andukuri et al., 2024; Zhang et al., 2025; Li et al., 2025) has primarily focused on asking clarifying questions in response to ambiguous user requests. However, these approaches often excel only at detecting obvious flaws, such as missing variables in tool usage (Wang et al., 2024), or operate in general conversational settings with limited ambiguity (Andukuri et al., 2024; Zhang et al., 2025). More complex cases requiring deeper reasoning remain under-explored. The most closely related work to ours is the recent COLLABLLM (Wu et al., 2025), which shares the same goal of enhancing human- AI collaboration through multi-turn conversations. In contrast, our work focuses on critical thinking, where an LLM should not only learn to collaborate with humans but also identify flaws and provide feedback to refine the context. To this end, we construct new datasets tailored to this objective and emphasize the role of reasoning in this setting, aspects that are orthogonal to COLLABLLM.

3 Preliminary

We define proactive critical thinking as the ability of a model to actively collaborate with humans rather than passively refusing to respond when receiving flawed inputs.

Proactive Questioning: A Preliminary Exploration on Proactive Critical Thinking In this work, we begin by formalizing the simplest scenario: Given a question x that may lack key information, the LLM π first attempts to generate its response y = π(x) through proactive critical thinking. To enable this capability, we augment the input with the following instruction:

Instruction for Activating Proactive Questioning

Question:

[QUESTION]

If the question is answerable, provide the final answer. Otherwise, ask the user for the necessary information by phrasing the request as a question. If the question x is answerable, the LLM directly provides the solution y = π(x). Otherwise, the model identifies the missing information and proactively generates a follow-up query q = π(x) to request clarification. Upon receiving the user’s response a to the query q, the LLM then synthesizes the final solution y = π(x, q, a) using all available information.

Simulating a User with a User Agent In the above setting, a user is required to respond to the LLM’s request. Since it is impractical to involve human participants, we use a strong LLM to simulate the user.

Supervised Fine-Tuning The most straightforward approach is to fine-tune the LLM directly on prepared human-AI interaction trajectories.

Reinforcement Learning On-policy RL has proven effective in enabling LLMs to independently explore strategies to achieve target objectives. In this work, we adopt the popular GRPO algorithm (Shao et al., 2024), training the model on the same question set used for SFT:

6.2 Main Results

Vanilla models fail to provide effective feedback to flawed prompts. As presented in Table 2, offthe- shelf models struggle with proactive critical thinking when confronted with flawed or ambiguous prompts.

Interestingly, the results for the Qwen3-8B model present an unexpected phenomenon: employing RL alone surpasses the performance of two-stage training. This may arise from the nature of the SFT data, which is self-generated by the Qwen3-8B model and thus does not inherently enhance its capabilities. Moreover, by further reinforcing its original high-probability tokens during SFT, the entropy of the model’s outputs may be inadvertently reduced. This could constrain the exploratory nature of the subsequent RL phase, thereby hindering its overall effectiveness. Training activates a beneficial “thinking mode”. A notable observation from our experiments is that RL fundamentally changes how models use their internal “thinking mode.” For vanilla models, activating the “thinking mode” often degrades performance. The extended thinking appears to induce counterproductive self-doubt rather than useful analysis, leading to a clear drop in performance.