Language Understanding and Pragmatics LLM Reasoning and Architecture Psychology and Social Cognition

Does a model improve by arguing with itself?

When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?

Note · 2026-02-21 · sourced from Argumentation

ReConcile (multi-LLM round-table with confidence-weighted voting) isolates a failure mode that earlier work had observed but not mechanistically explained: Degeneration-of-Thought.

The pattern: when a model is asked to reconsider its answer in response to a challenge from itself — its own previous reasoning reframed as external criticism — it doesn't maintain its position or improve it. It capitulates. And crucially, it does so with increasing confidence. The model ends more certain of the wrong answer than it was before self-revision began.

This is worse than no revision at all. Single-model self-reflection degrades not just accuracy but calibration. The model convinces itself.

The contrast with multi-agent debate is sharp. When diverse models challenge each other's reasoning, accuracy improves. The same model that capitulates to its own previous reasoning holds up better when genuinely different reasoning challenges it. The diversity of the external challenge is load-bearing — homogeneous multi-agent systems (same model, multiple instances) degrade similarly to self-revision.

The mechanism: self-revision exposes the model to its own rhetorical patterns. The model finds its own argument familiar and well-framed — the confidence signals it reads in external arguments. Multi-agent diverse debate introduces framing and vocabulary the model did not generate, which it must evaluate on logical rather than stylistic grounds.

This sits alongside Does self-revision actually improve reasoning in language models? but adds the contrastive finding. Self-revision degrades; diverse debate improves. The key variable is not the number of revision steps but the source of the challenge. Why does parallel reasoning outperform single chain thinking? maps the same pattern at the token level — parallel diversity beats sequential revision here at the agent level.

The implication: "self-reflection" as a prompting technique is not a universal improvement. It is specifically harmful when the model is the only source of disagreement. Genuine improvement requires external diversity — either multiple distinct models or structured dissent mechanisms.

Three root causes of DoT (from Arxiv/Agents Multi, MAD framework): The Multi-Agent Debate paper identifies three specific causes of Degeneration-of-Thought: (1) Bias and distorted perception — self-perception influenced by biases and preconceived notions learned from pretraining data, leading to instinctively inaccurate conclusions; (2) Rigidity and resistance to change — the model holds rigid beliefs and struggles to engage in self-reflection that challenges its assumptions; (3) Limited external feedback — self-reflection is purely internal, missing alternative viewpoints and blind spots that external feedback provides. Multi-agent debate is explicitly framed as an "encouragement of divergent thinking" — creating the external pressure that breaks rigidity and provides the feedback loop that self-reflection lacks. The three causes map to three failure dimensions: epistemic (biased priors), motivational (change resistance), and architectural (no external signal).

Society of Minds foundation (Du et al.): The Du et al. "Improving Factuality and Reasoning through Multiagent Debate" paper provides the foundational empirical grounding and the "Society of Mind" framing (after Minsky). In their setup, multiple model instances individually propose responses, then each reads and critiques all others' responses and updates its own answer over multiple rounds. The key structural element: each agent must construct an answer consistent with both its internal critic AND sensible peer assessments — dual coherence requirements that single-model self-revision lacks. This paper documents significant gains in mathematical and strategic reasoning across multiple tasks, and was an early demonstration that diverse external challenge is load-bearing for reasoning improvement.

Source: Argumentation

Related concepts in this collection

Does self-revision actually improve reasoning in language models? When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
base finding; this note adds the mechanism and the contrastive multi-agent finding
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
same pattern at token level: parallel diversity beats sequential self-revision
Why do multi-agent LLM systems converge without real debate? When multiple AI agents reason together, do they genuinely deliberate or just accommodate each other's views? Research into clinical reasoning systems reveals how often agents reach agreement without substantive disagreement.
the multi-agent version of the same convergence problem
Why does majority voting outperform more complex inference methods? Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
converging evidence
Can agents learn from failure without updating their weights? Explores whether language models can improve through trial-and-error by storing reflections in memory rather than through gradient-based parameter updates. Tests if environmental feedback alone can drive learning.
architectural solution: Reflexion avoids degeneration-of-thought by grounding reflection in binary environmental outcomes, not self-assessment
Can storing evolved thoughts prevent inconsistent reasoning in conversations? When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?
TiM's post-thinking operates on the same terrain: repeated reasoning over the same material risks degeneration, so TiM reasons once during a consolidation phase and stores the result
Can AI systems detect when they've genuinely reached agreement? When multiple AI agents debate, they often converge without actually deliberating. Can a dedicated agent reliably identify true agreement versus false consensus, and would that improve debate outcomes?
agreement-detection is the architectural safeguard against multi-agent degeneration: explicit verification that convergence is evidence-based prevents premature accommodation that produces the same confidence-amplification failure at group level
Do prior errors in context history amplify future errors? When a language model makes mistakes early in a task, do those errors contaminate subsequent predictions? We explore whether error accumulation degrades long-horizon performance through passive context pollution rather than capability limits.
self-conditioning is the passive version of degeneration-of-thought: DoT actively amplifies confidence in wrong answers through deliberate re-examination, while self-conditioning passively degrades accuracy through context contamination — both are single-source error amplification
Can multiple LLMs coordinate without explicit collaboration rules? When multiple language models share a concurrent key-value cache, do they spontaneously develop coordination strategies? This matters because it could reveal how reasoning models naturally collaborate and inform more efficient parallel inference.
alternative to turn-based debate: Hogwild! enables real-time multi-instance interaction through shared memory rather than discrete message-passing, providing the external diversity that prevents degeneration-of-thought while avoiding the latency of sequential debate rounds
Why does self-correction training on offline data fail? Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.
SCoRe offers a training-time solution to degeneration-of-thought: by training self-correction under the model's own error distribution with RL, the model learns to genuinely correct rather than capitulate — addressing the root cause (untrained self-revision) rather than the symptom (multi-agent workaround)
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
the training-time version: DoT amplifies confidence in wrong answers within a single inference through self-revision, while error avalanching amplifies errors across self-training iterations through learning from mistakes — both are single-source error loops where the model's own outputs serve as an unreliable correction signal
Can generative and discriminative models reach agreement? Generative and discriminative decoding often produce conflicting answers. Can a game-theoretic framework force these two complementary procedures to reconcile their predictions into a single, more reliable output?
Consensus Game provides within-model diversity that prevents DoT: instead of self-revision (where the model capitulates to its own framing), Equilibrium-Ranking forces generative and discriminative procedures to reach genuine agreement, achieving multi-agent benefits without the single-source collapse

Concept map

27 direct connections · 198 in 2-hop network ·medium cluster

Does a model improve by arguing with itself? Does self-revision actually improve reasoning in l… Why does parallel reasoning outperform single chai… Why do multi-agent LLM systems converge without re… Why does majority voting outperform more complex i… Can agents learn from failure without updating the… Can storing evolved thoughts prevent inconsistent … Can AI systems detect when they've genuinely reach… Do prior errors in context history amplify future …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

degeneration of thought is a distinct failure mode where single-model self-revision amplifies confidence in wrong answers while multi-agent debate prevents it