What makes multi-turn critique trajectories more effective than single-turn reasoning chains?

This explores why a back-and-forth process of critique and revision (multiple turns) tends to outperform a single long reasoning chain — and what corpus material says about the failure modes of long chains and the corrective role of critique.

This explores why a back-and-forth process of critique and revision tends to beat one long uninterrupted reasoning chain. The corpus suggests the answer isn't that multi-turn critique adds more thinking — it's that long single-turn chains have characteristic ways of going wrong that critique structurally counteracts.

Start with how single chains fail. Reasoning models don't usually fail for lack of compute; they fail through disorganization — wandering down invalid paths and, paradoxically, abandoning promising ones too early (Why do reasoning models abandon promising solution paths?, Do reasoning models switch between ideas too frequently?). Longer isn't automatically better either: accuracy against chain length follows an inverted-U, peaking at an intermediate length and declining as chains sprawl (Why does chain of thought accuracy eventually decline with length?). And fluent-looking reflection doesn't equal competence — frontier reasoners sustain long reflective chains yet still score only ~20% on constraint problems that demand genuine backtracking (Can reasoning models actually sustain long-chain reflection?). A single chain, in other words, has no mechanism to notice it's wandering or to keep its options open.

Critique trajectories supply that mechanism. The most striking finding is that step-level critique woven into training preserves *exploration diversity* — it counteracts "tail narrowing," the tendency of self-training to prematurely collapse onto one family of solutions (Do critique models improve diversity during training itself?). That maps directly onto the single-chain failure mode of premature path-switching and early convergence: critique forces breadth where a lone chain rushes to depth. The same logic appears in work showing that allocating compute to diverse abstractions enforces breadth-first search and prevents underthinking (Can abstractions guide exploration better than depth alone?).

There's a subtler structural reason too. Reasoning is steered by a few high-leverage "thought anchors" — planning and backtracking sentences that pivot everything after them (Which sentences actually steer a reasoning trace?). A multi-turn critique loop is essentially a way to manufacture more, better backtracking pivots from the outside, rather than hoping the model generates them on its own mid-stream. Turn boundaries also protect context: research on long-horizon search shows that capping reasoning *per turn* prevents a single bloated turn from eating the context window future steps need (Does limiting reasoning per turn improve multi-turn search quality?). Multiple turns keep each unit of reasoning short enough to stay coherent.

The quietly destabilizing note is that chain-of-thought may be closer to pattern-matched imitation than genuine inference — format and structure drive it far more than logical content, and invalid reasoning prompts often work as well as valid ones (What makes chain-of-thought reasoning actually work?, What makes chain-of-thought reasoning actually work?). If a single chain is reproducing the *form* of reasoning rather than verifying it, then an external critique turn — a second pass whose whole job is to check rather than continue — is doing work the chain itself never actually does. That reframes multi-turn critique not as "more reasoning" but as the thing that adds verification a single chain only pretends to have.

Sources 10 notes

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Which sentences actually steer a reasoning trace?

Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

What makes multi-turn critique trajectories more effective than single-turn reasoning chains?

Sources 10 notes

Next inquiring lines