Reinforcement Learning for LLMs LLM Reasoning and Architecture

Why does asking models to think first hurt performance?

Initial prompts to generate internal thoughts degrade instruction-following performance. What reverses this harm, and can thinking become useful beyond math and logic?

Note · 2026-02-23 · sourced from Cognitive Models Latent

Thought Preference Optimization (TPO) reveals a counterintuitive dynamic in three stages:

Stage 1: Thinking hurts. An instruction-tuned model prompted to write internal thoughts before responding performs worse than the same model responding directly. This aligns with meta-analysis findings that CoT prompting only helps math and logic tasks. For general instruction following (creative writing, planning, understanding complex instructions), initial thoughts are not just unhelpful — they actively degrade performance. The instruction-tuned model has been heavily optimized for direct responses, and inserting unoptimized thoughts disrupts that optimization.

Stage 2: RL teaches useful thinking. Through iterative RLAIF training, the model learns to generate thoughts that actually improve responses. The key design: a standard judge model evaluates only the response, never seeing the hidden thoughts. This forces the model to develop thoughts that produce better responses rather than thoughts that look good to an evaluator. No human-curated thoughts or specialized thought-judge required.

Stage 3: Broad utility emerges. After training, thinking improves performance across general instruction-following tasks — not just math and logic. Internal thoughts serve planning (overall structure and characters for creative writing), instruction comprehension (parsing complex user requests), and strategy selection.

Two design principles matter. First, hiding the thoughts from the judge avoids the need for a thought-evaluation model — which would be inherently challenging since human thoughts are poorly documented and may not transfer to LLM thinking. Second, allowing thoughts to take "uninteresting" forms (making mistakes, drafting and evaluating responses, trying to understand the question) is essential — these forms would typically be pruned by a thought-evaluating judge but are precisely what makes thoughts useful.

This connects directly to Does RL teach reasoning or just when to use it?: TPO provides concrete evidence that RL teaches when and how to deploy internal reasoning for different task types, not the reasoning capability itself. The capability was already present (the model could generate thoughts from the start) — what was missing was the optimization signal for making thoughts serve responses.

The overthinking connection is also important. Since Does more thinking time actually improve LLM reasoning?, TPO demonstrates the mechanism: unoptimized thinking actively hurts. Only RL-trained thinking helps. The quality of thinking matters more than its quantity.


Source: Cognitive Models Latent

Related concepts in this collection

Concept map
14 direct connections · 123 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

internal thought generation initially degrades performance until rl training adapts thoughts to serve responses — extending thinking beyond math and logic