Reinforcement Learning for LLMs LLM Reasoning and Architecture

Do iterative refinement methods suffer from overthinking?

Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The overthinking failure documented in o1-like models — sequential token extension degrades accuracy beyond a critical threshold — has a structural analog in iterative refinement methods like Self-Refine, Reflexion, and self-consistency loops.

Both approaches share the same architecture:

  1. Generate an initial response
  2. Produce a critique or evaluation
  3. Generate a revision based on the critique
  4. Repeat

In o1-like models, this happens within a single inference call, at the token level. In iterative refinement methods, this happens across multiple inference calls, at the response level. The timescale differs; the structure is identical.

The empirical evidence predicts the same failure mode: Does self-revision actually improve reasoning in language models? shows that within-inference revision tends to hurt. The Self-Refine paper itself reports mixed results — self-reflection improves TruthfulQA performance but decreases performance on HotpotQA. This is exactly what the overthinking literature predicts: revision is helpful when the initial response is factually uncertain and harmful when the task requires multi-step reasoning where revision introduces noise.

PDR provides a counterexample: iterative refinement CAN avoid overthinking when memory is compressed between iterations. Progressive Draft Refinement (Reasoning Beyond the Rug) introduces short iterations that read a bounded summary, write a refinement, and re-synthesize a fresh summary. Unlike long CoT or standard iterative refinement, PDR compresses evidence between rounds — the model doesn't carry forward its full reasoning history, only a compact distillation. This breaks the overthinking dynamic: each iteration starts from a compressed state rather than accumulating noise from all previous iterations. PDR outperforms long-trace baselines at matched compute (+11% on AIME 2024, +9% on AIME 2025), showing "evidence accumulation via bounded summaries can substitute for long reasoning traces while holding latency fixed." The insight: the overthinking failure is not inherent to iteration — it's inherent to unbounded accumulation. Compress between rounds and the failure mode disappears. Since ReBalance uses confidence as continuous indicator to dynamically steer between overthinking and underthinking, PDR's bounded memory and ReBalance's confidence-based steering are complementary solutions to the same underlying problem: preventing reasoning from crossing the quality threshold.

The parallel alternative applies at both timescales. Instead of sequential revision (iterate until convergence), generate multiple independent candidates in parallel and aggregate by majority vote. Why does majority voting outperform more complex inference methods? applies equally to iterative refinement: diverse independent reasoning beats iterated single-path refinement.

This connection bridges the test-time scaling batch to the Self Refinement literature. The overthinking insight isn't just about thinking tokens — it's about sequential-over-parallel as a general failure mode that appears at any timescale.

PDR as a fix: The Parallel-Distill-Refine framework addresses iterative refinement's core failure by introducing a bounded distillation workspace between iterations. Instead of appending all prior attempts to context (recreating long-context failures) or forgetting them (losing progress), PDR generates a compact summary listing agreements, contradictions, intermediate results, and open subgoals. Each new iteration starts fresh but with accumulated wisdom. The four meta-skills required — verification, refinement, compression, and diversification — map directly to the failure modes: anchoring bias (addressed by diversification), forgetfulness (addressed by compression), and noise injection (addressed by verification). RL training to make the model consistent with PDR as inference method further narrows the train-test gap.

Progressive-Hint Prompting (PHP) demonstrates the iterative refinement pattern at the prompting level. Previous answers are fed back as "hints" to guide subsequent reasoning — the question and prior answer are combined to re-prompt the LLM, repeating until the answer stabilizes across two consecutive iterations. PHP is orthogonal to CoT and self-consistency, allowing combination. However, the hint-based anchoring mechanism potentially compounds errors: if an early answer is confidently wrong, subsequent iterations may anchor to it rather than escape. This is iterative refinement at the prompt level reproducing the same slow-timescale overthinking that training-level methods exhibit. Source: Arxiv/Prompts Prompting.


Source: Test Time Compute, Self Refinement Self Consistency Feedback

Related concepts in this collection

Concept map
23 direct connections · 195 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

iterative refinement methods reproduce the overthinking failure mode at slower timescales