Why does asking models to think first hurt performance?
Initial prompts to generate internal thoughts degrade instruction-following performance. What reverses this harm, and can thinking become useful beyond math and logic?
Thought Preference Optimization (TPO) reveals a counterintuitive dynamic in three stages:
Stage 1: Thinking hurts. An instruction-tuned model prompted to write internal thoughts before responding performs worse than the same model responding directly. This aligns with meta-analysis findings that CoT prompting only helps math and logic tasks. For general instruction following (creative writing, planning, understanding complex instructions), initial thoughts are not just unhelpful — they actively degrade performance. The instruction-tuned model has been heavily optimized for direct responses, and inserting unoptimized thoughts disrupts that optimization.
Stage 2: RL teaches useful thinking. Through iterative RLAIF training, the model learns to generate thoughts that actually improve responses. The key design: a standard judge model evaluates only the response, never seeing the hidden thoughts. This forces the model to develop thoughts that produce better responses rather than thoughts that look good to an evaluator. No human-curated thoughts or specialized thought-judge required.
Stage 3: Broad utility emerges. After training, thinking improves performance across general instruction-following tasks — not just math and logic. Internal thoughts serve planning (overall structure and characters for creative writing), instruction comprehension (parsing complex user requests), and strategy selection.
Two design principles matter. First, hiding the thoughts from the judge avoids the need for a thought-evaluation model — which would be inherently challenging since human thoughts are poorly documented and may not transfer to LLM thinking. Second, allowing thoughts to take "uninteresting" forms (making mistakes, drafting and evaluating responses, trying to understand the question) is essential — these forms would typically be pruned by a thought-evaluating judge but are precisely what makes thoughts useful.
This connects directly to Does RL teach reasoning or just when to use it?: TPO provides concrete evidence that RL teaches when and how to deploy internal reasoning for different task types, not the reasoning capability itself. The capability was already present (the model could generate thoughts from the start) — what was missing was the optimization signal for making thoughts serve responses.
The overthinking connection is also important. Since Does more thinking time actually improve LLM reasoning?, TPO demonstrates the mechanism: unoptimized thinking actively hurts. Only RL-trained thinking helps. The quality of thinking matters more than its quantity.
Source: Cognitive Models Latent
Related concepts in this collection
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
TPO provides direct evidence: RL teaches deployment of thinking, not thinking itself
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
TPO shows unoptimized thinking hurts; explains WHY more thinking can degrade performance
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
TPO overcomes this split: RL-trained thinking helps across task types
-
How does thinking emerge from policy selection in RL?
Explores whether thinking is fundamentally about selecting between existing sub-policies rather than building new reasoning from scratch. This matters for understanding how RL training unlocks latent capabilities in language models.
related mechanism: thinking emerges when RL provides selection pressure
-
Can models learn when to think versus respond quickly?
Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.
TPO is a concrete implementation of this principle
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
internal thought generation initially degrades performance until rl training adapts thoughts to serve responses — extending thinking beyond math and logic