Does extended thinking help or hurt model reasoning?
Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.
The proactive critical thinking experiments reveal a striking interaction between training and inference-time reasoning. For vanilla (off-the-shelf) models, activating "thinking mode" — the extended internal reasoning chains used by models like Qwen3 — actually degrades performance on proactive critical thinking tasks. The extended thinking "appears to induce counterproductive self-doubt rather than useful analysis, leading to a clear drop in performance."
But after RL training on proactive critical thinking tasks, the same thinking mode becomes beneficial. Training fundamentally changes how models use their internal reasoning. This is not merely about more or less thinking — it is about the quality direction of thinking.
The finding connects to several established insights but adds a distinct mechanism:
Since Does RL teach reasoning or just when to use it?, RL manages the timing of reasoning. The proactive thinking result extends this: RL also manages the mode of reasoning — redirecting extended thinking from unproductive self-doubt toward productive gap analysis.
The SFT finding adds nuance: when SFT data is self-generated by the model, it "does not inherently enhance its capabilities" and may reduce output entropy, constraining the subsequent RL phase. This echoes Does policy entropy collapse limit reasoning performance in RL? — SFT-then-RL may face the same entropy collapse that pure RL faces, but through a different mechanism (entropy reduction from self-generated imitation rather than RL convergence).
The practical implication: extended thinking is not a universal good. It is a resource that can be directed productively or destructively, and the direction depends on training. "More thinking" applied to a model without the right training signal may systematically make things worse.
Source: Conversation Agents
Related concepts in this collection
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RL manages timing; this paper shows RL also manages quality direction of reasoning
-
Can models learn when to think versus respond quickly?
Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.
DeGRPO mode selection; proactive thinking adds a training-mediated quality dimension
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
SFT-then-RL may face entropy collapse through self-generated imitation
-
What critical thinking skills do reasoning models actually lose?
Step-by-step reasoning training optimizes narrow deductive thinking while degrading meta-cognitive abilities like recognizing futile thinking and maintaining tentative reasoning. Understanding this tradeoff matters for deploying reasoning models reliably.
the thinking-mode reversal is a specific instance of the broader critical thinking problem: reasoning training optimizes one narrow type of thinking while degrading others; the proactive thinking result shows RL can selectively repair one form of degradation (self-doubt → gap analysis) while the critical thinking post documents the broader pattern
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
rl training transforms thinking mode from counterproductive self-doubt into beneficial proactive analysis — the same mechanism helps or hurts depending on training