Why does revision often make reasoning accuracy worse in frontier models?

This explores why a model rewriting its own reasoning tends to lock in errors rather than fix them — and what separates the revisions that help from the ones that hurt.

This explores why revision so often backfires in frontier models, and the corpus points to a sharp answer: it's not the act of revising that hurts, it's the *source* of the critique. When a model revises its own uncertain output, it typically amplifies confidence in the wrong answer instead of correcting it — but revision guided by an external critic improves accuracy. The lever is whose judgment drives the rewrite, not whether a rewrite happens at all Does revising your own reasoning actually help or hurt?. A model grading its own work has no independent vantage point, so revision becomes a confidence-laundering step rather than an error-finding one.

There's a mechanical reason self-revision drags accuracy down: failed reasoning doesn't get cleanly discarded. The single strongest predictor of whether a model gets the right answer is the *fraction of steps sitting in abandoned branches* — not how long the trace is, and not how much the model reviews. Dead-end reasoning lingers in context and actively biases what comes next, an effect confirmed by directly editing those branches out Does failed-step fraction predict reasoning quality better?. Revision compounds this: each rewrite piles more abandoned scaffolding into the context the model is reasoning over, so the model is increasingly reading its own wrong turns back to itself.

This connects to a broader pattern where more reasoning effort stops helping past a point. Accuracy follows an inverted-U against chain-of-thought length — it climbs, peaks, then declines — and pushing thinking tokens from ~1K to ~16K dropped one benchmark from 87% to 70% Why does chain of thought accuracy eventually decline with length? Does more thinking time always improve reasoning accuracy?. Revision is one of the easiest ways to blow past that peak. Relatedly, models that 'wander' and switch paths prematurely reason like tourists, not scientists — and simply penalizing thought-switching during decoding raises accuracy with no retraining, because the better solution was already there and got abandoned Why do reasoning models abandon promising solution paths? Do reasoning models switch between ideas too frequently?. Self-revision is institutionalized path-switching.

The surprising twist is that the corpus also shows the model often had the right answer earlier and revised away from it. Sampling completions from *intermediate* points in a reasoning trace and taking the most common answer beats the final conclusion by up to 13%, because early commitment narrows the solution space Can intermediate reasoning points yield better answers than final ones?. So revision isn't just failing to add value — it can walk the model off a correct intermediate answer toward a worse final one.

What actually rescues revision is an outside signal. The same self-doubt that degrades a vanilla model's extended thinking gets converted into productive gap-analysis by RL training — the mechanism is identical, only the training changes its sign Does extended thinking help or hurt model reasoning?. If you want a takeaway you didn't come looking for: the cheapest reliable fix for bad revision isn't a smarter reviser — it's a *different* one. Routing a query to a second model, or grounding revision in external critique, consistently beats a model relitigating its own answer in its own head Can routing beat building one better model?.

Sources 9 notes

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does failed-step fraction predict reasoning quality better?

Across 10 reasoning models, the fraction of steps in abandoned branches consistently predicts correctness better than CoT length or review ratio. Failed branches persist in context and bias subsequent reasoning, a phenomenon confirmed through correlation, reranking, and direct causal editing.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can intermediate reasoning points yield better answers than final ones?

Segmenting reasoning traces into subthoughts and prompting completions from each intermediate point yields mode answers up to 13% more accurate than final answers. This works because it mines alternative paths before early commitment narrows the solution space.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Why does revision often make reasoning accuracy worse in frontier models?

Sources 9 notes

Next inquiring lines