Does training on self-play disagreement data improve multi-agent reasoning outcomes?

This explores whether letting agents argue with each other — and training on the friction those disagreements produce — actually makes multi-agent reasoning better, or whether disagreement is just noise.

This explores whether disagreement between agents is a *training signal* worth harvesting, rather than a coordination problem to suppress. The corpus doesn't contain a single paper that bolts "self-play + disagreement data" together under that name, but it has the component parts — and reading them laterally, they point toward a qualified yes, with sharp caveats.

The strongest evidence that disagreement helps reasoning is structural. Can dialogue format help models reason more diversely? shows that forcing a *single* model to reason as a back-and-forth between distinct agents beats monologue reasoning on diversity and coherence — the disagreement format itself breaks the fixed-strategy rut that solo reasoning falls into. Can branching prompts replicate what multi-agent systems do? pushes this further: structured multi-persona prompting is functionally equivalent to multi-agent debate, suggesting the *gains come from the adversarial structure*, not from spinning up separate models. And Can formal argumentation make AI decisions truly contestable? gives the cleanest version of why disagreement is informative — attack/defense graphs make explicit which premises are actually contested, which is exactly the data a training signal would want to capture.

The self-play side is where it gets interesting. Can agents learn beyond what their training data shows? is the motive: agents trained only on static expert data can't learn from their own failures and are capped by what curators imagined. Self-play is the escape hatch. Can adversarial critics replace task-specific verifiers for reasoning? is the closest thing in the corpus to your literal question — RARO runs an adversarial game where a critic learns to discriminate expert answers from the policy's own, and that disagreement signal *replaces* hand-built verifiers while matching their scaling. That's training-on-disagreement in everything but name. Related, Can model confidence work as a reward signal for reasoning? shows internally-generated preference signals (here, confidence gaps between traces) can strengthen reasoning without external labels — the same "mine your own disagreement" logic.

But here's the part you didn't know you wanted to know: more disagreeing agents does *not* reliably mean better outcomes, and the failure isn't subtle. Why do multi-agent systems fail to coordinate at scale? and Can LLM agent groups reliably reach consensus together? both show multi-agent groups degrade as they grow — not because agents get corrupted, but because they *time out and stall before converging*. Disagreement that never resolves is dead weight, not training signal. So the answer hinges on whether the disagreement is *resolved into a learnable preference* (RARO's critic, confidence ranking, argumentation graphs) or just left as unresolved conflict (raw consensus failure). Can RL agents learn to reason better, not just succeed? hints at the bridge — rewarding the *process* of reflection and monitoring, not just outcomes, is how you'd turn disagreement into a signal the model can actually train on.

One grounding caution worth carrying: Does chain-of-thought reasoning actually generalize beyond training data? shows reasoning that *looks* valid can be logically hollow outside the training distribution. Self-play disagreement that rewards persuasive-sounding traces rather than correct ones risks amplifying exactly that — fluent reasoning that doesn't generalize.

Sources 10 notes

Can dialogue format help models reason more diversely?

DialogueReason, which structures a single model's internal reasoning as dialogue between distinct agents in separate scenes, overcomes monologue reasoning's fixed-strategy and fragmented-attention weaknesses, especially on tasks requiring multiple problem-solving approaches.

Can branching prompts replicate what multi-agent systems do?

Research shows single LLMs using dynamic persona simulation achieve multi-agent cognitive synergy without multiple model instances. Solo Performance Prompting validates that structured prompting techniques map directly to multi-agent debate architectures, enabling equivalent outcomes through structural equivalence.

Can formal argumentation make AI decisions truly contestable?

Dung-style argumentation structures AI outputs as traversable attack/defense graphs, allowing users to identify and contest specific premises. Standard LLM outputs lack this structure, making it impossible to pinpoint which claims users actually reject.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can adversarial critics replace task-specific verifiers for reasoning?

RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does training on self-play disagreement data improve multi-agent reasoning outcomes?

Sources 10 notes

Next inquiring lines