Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does extended thinking help or hurt model reasoning?

Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.

Note · 2026-02-22 · sourced from Conversation Agents
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The proactive critical thinking experiments reveal a striking interaction between training and inference-time reasoning. For vanilla (off-the-shelf) models, activating "thinking mode" — the extended internal reasoning chains used by models like Qwen3 — actually degrades performance on proactive critical thinking tasks. The extended thinking "appears to induce counterproductive self-doubt rather than useful analysis, leading to a clear drop in performance."

But after RL training on proactive critical thinking tasks, the same thinking mode becomes beneficial. Training fundamentally changes how models use their internal reasoning. This is not merely about more or less thinking — it is about the quality direction of thinking.

The finding connects to several established insights but adds a distinct mechanism:

Since Does RL teach reasoning or just when to use it?, RL manages the timing of reasoning. The proactive thinking result extends this: RL also manages the mode of reasoning — redirecting extended thinking from unproductive self-doubt toward productive gap analysis.

The SFT finding adds nuance: when SFT data is self-generated by the model, it "does not inherently enhance its capabilities" and may reduce output entropy, constraining the subsequent RL phase. This echoes Does policy entropy collapse limit reasoning performance in RL? — SFT-then-RL may face the same entropy collapse that pure RL faces, but through a different mechanism (entropy reduction from self-generated imitation rather than RL convergence).

The practical implication: extended thinking is not a universal good. It is a resource that can be directed productively or destructively, and the direction depends on training. "More thinking" applied to a model without the right training signal may systematically make things worse.


Source: Conversation Agents

Related concepts in this collection

Concept map
14 direct connections · 163 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl training transforms thinking mode from counterproductive self-doubt into beneficial proactive analysis — the same mechanism helps or hurts depending on training