Reinforcement Learning for LLMs LLM Reasoning and Architecture

Do iterative refinement methods suffer from overthinking?

Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?

Note · 2026-02-20 · sourced from Test Time Compute

The overthinking failure documented in o1-like models — sequential token extension degrades accuracy beyond a critical threshold — has a structural analog in iterative refinement methods like Self-Refine, Reflexion, and self-consistency loops.

Both approaches share the same architecture:

Generate an initial response
Produce a critique or evaluation
Generate a revision based on the critique
Repeat

In o1-like models, this happens within a single inference call, at the token level. In iterative refinement methods, this happens across multiple inference calls, at the response level. The timescale differs; the structure is identical.

The empirical evidence predicts the same failure mode: Does self-revision actually improve reasoning in language models? shows that within-inference revision tends to hurt. The Self-Refine paper itself reports mixed results — self-reflection improves TruthfulQA performance but decreases performance on HotpotQA. This is exactly what the overthinking literature predicts: revision is helpful when the initial response is factually uncertain and harmful when the task requires multi-step reasoning where revision introduces noise.

PDR provides a counterexample: iterative refinement CAN avoid overthinking when memory is compressed between iterations. Progressive Draft Refinement (Reasoning Beyond the Rug) introduces short iterations that read a bounded summary, write a refinement, and re-synthesize a fresh summary. Unlike long CoT or standard iterative refinement, PDR compresses evidence between rounds — the model doesn't carry forward its full reasoning history, only a compact distillation. This breaks the overthinking dynamic: each iteration starts from a compressed state rather than accumulating noise from all previous iterations. PDR outperforms long-trace baselines at matched compute (+11% on AIME 2024, +9% on AIME 2025), showing "evidence accumulation via bounded summaries can substitute for long reasoning traces while holding latency fixed." The insight: the overthinking failure is not inherent to iteration — it's inherent to unbounded accumulation. Compress between rounds and the failure mode disappears. Since ReBalance uses confidence as continuous indicator to dynamically steer between overthinking and underthinking, PDR's bounded memory and ReBalance's confidence-based steering are complementary solutions to the same underlying problem: preventing reasoning from crossing the quality threshold.

The parallel alternative applies at both timescales. Instead of sequential revision (iterate until convergence), generate multiple independent candidates in parallel and aggregate by majority vote. Why does majority voting outperform more complex inference methods? applies equally to iterative refinement: diverse independent reasoning beats iterated single-path refinement.

This connection bridges the test-time scaling batch to the Self Refinement literature. The overthinking insight isn't just about thinking tokens — it's about sequential-over-parallel as a general failure mode that appears at any timescale.

PDR as a fix: The Parallel-Distill-Refine framework addresses iterative refinement's core failure by introducing a bounded distillation workspace between iterations. Instead of appending all prior attempts to context (recreating long-context failures) or forgetting them (losing progress), PDR generates a compact summary listing agreements, contradictions, intermediate results, and open subgoals. Each new iteration starts fresh but with accumulated wisdom. The four meta-skills required — verification, refinement, compression, and diversification — map directly to the failure modes: anchoring bias (addressed by diversification), forgetfulness (addressed by compression), and noise injection (addressed by verification). RL training to make the model consistent with PDR as inference method further narrows the train-test gap.

Progressive-Hint Prompting (PHP) demonstrates the iterative refinement pattern at the prompting level. Previous answers are fed back as "hints" to guide subsequent reasoning — the question and prior answer are combined to re-prompt the LLM, repeating until the answer stabilizes across two consecutive iterations. PHP is orthogonal to CoT and self-consistency, allowing combination. However, the hint-based anchoring mechanism potentially compounds errors: if an early answer is confidently wrong, subsequent iterations may anchor to it rather than escape. This is iterative refinement at the prompt level reproducing the same slow-timescale overthinking that training-level methods exhibit. Source: Arxiv/Prompts Prompting.

Source: Test Time Compute, Self Refinement Self Consistency Feedback

Related concepts in this collection

Does self-revision actually improve reasoning in language models? When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
the within-inference version of this same failure
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the alternative; applies at both timescales
Does extended thinking actually improve reasoning or just increase variance? When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
the mechanism; variance inflation from sequential extension
Why does majority voting outperform more complex inference methods? Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
the empirical resolution; relevant to iterative refinement too
Does limiting reasoning per turn improve multi-turn search quality? When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?
extends: ASearcher documents the same failure mode in multi-turn retrieval search; the timescale is the retrieval cycle, the mechanism is identical
Does supervised fine-tuning actually improve reasoning quality? While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
extends the generalization: the quality-accuracy trade-off appears at training time (SFT), at test-time inference (overthinking), and at iterative revision (refinement) — three timescales, same structural failure
When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
active retrieval offers an escape from the iterative refinement spiral: instead of revising from the same information, retrieve new evidence when uncertainty is detected; the failure of sequential revision is partly an information-poverty problem
Can models reason without generating visible thinking steps? Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
architectural escape from the sequential-extension failure: latent recurrent models iterate in compressed hidden space rather than generating revision tokens, bypassing the variance inflation and anchoring bias that drive overthinking at every timescale
Do large language models use one reasoning style or many? Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
cross-domain confirmation: DeepSeek-R1 exhibits "repeated self-doubt" loops in competitive games, reproducing the overthinking pattern in strategic reasoning where longer chains signal hesitation not depth
Why do reasoning models struggle with theory of mind tasks? Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
ToM is the clearest case of unproductive reasoning effort: models produce significantly longer traces for social reasoning than factual questions, yet effort is uncorrelated with accuracy — the overthinking failure applied to an entirely different cognitive domain
Do personality types shape how AI agents make strategic choices? This research explores whether priming LLM agents with MBTI personality profiles causes them to adopt different strategic behaviors in games. Understanding this matters for designing AI systems optimized for specific tasks.
Introversion priming produces longer rationales and deeper deliberation, which may interact with the overthinking failure: personality conditioning could modulate the threshold at which sequential refinement degrades rather than improves
When should an agent actually stop and deliberate? How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
SAND provides a principled solution to the overthinking failure at the action level: self-consistency-based gating ensures deliberation occurs only at uncertain steps, preventing the universal-deliberation trap that reproduces overthinking at the agentic timescale
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
same structural failure at a faster timescale: underthinking switches between reasoning threads within a single call; iterative refinement switches between solution attempts across calls; TIP's transition penalties suggest an analogous fix for refinement loops

Concept map

23 direct connections · 195 in 2-hop network ·medium cluster

Do iterative refinement methods suffer from over… Does self-revision actually improve reasoning in l… Why does parallel reasoning outperform single chai… Does extended thinking actually improve reasoning … Why does majority voting outperform more complex i… Does limiting reasoning per turn improve multi-tur… Does supervised fine-tuning actually improve reaso… When should retrieval happen during model generati… Can models reason without generating visible think…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

iterative refinement methods reproduce the overthinking failure mode at slower timescales