Agentic and Multi-Agent Systems Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can multiple agents stay diverse during training together?

Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.

Note · 2026-02-23 · sourced from Agents Multi

Single-agent self-improvement through iterative finetuning hits a wall fast. After one round of finetuning on its own generated outputs, performance saturates and begins to drop — the model becomes fixated on a narrow range of responses, limiting diversity and degrading accuracy. This is the training-time analog of Does a model improve by arguing with itself? at inference time: a single model trapped in its own distribution.

The multiagent finetuning framework (Du et al., 2025) proposes a structural fix: instead of training one model iteratively, train a society of models, each starting from the same base but independently specialized through distinct training data generated via multi-agent interactions. Generation agents produce initial responses; critic agents evaluate and refine them through debate. Each model sees different data because the interactions are role-dependent.

The mechanism works because role specialization prevents convergence to a single mode. When one model is trained to generate and another to critique, their training distributions diverge, maintaining the diversity that single-agent training destroys. The summarization step between debate rounds further helps by eliminating redundant information and retaining critical points — removing summarization hurts performance. Removing critics also degrades output quality, confirming that the evaluative role is load-bearing, not decorative.

This connects directly to Does policy entropy collapse limit reasoning performance in RL?: the entropy collapse that limits RL training is mitigated when multiple agents maintain distinct policy distributions. And since Why do LLMs generate novel ideas from narrow ranges?, the training-time diversity preservation through multi-agent specialization could address the output-time diversity problem upstream.

The cost is real — multiple model copies for training and inference. But the finding that single-agent FT collapses after one iteration means the choice is not "cheap single-agent" vs "expensive multi-agent" but "one iteration of productive training" vs "sustained improvement across many rounds."

Source: Agents Multi

Related concepts in this collection

Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
inference-time analog; this is the training-time version
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the entropy dynamics this approach counteracts
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
the output-time diversity problem this could address upstream
Why do multi-agent LLM systems converge without real debate? When multiple AI agents reason together, do they genuinely deliberate or just accommodate each other's views? Research into clinical reasoning systems reveals how often agents reach agreement without substantive disagreement.
the multi-agent convergence failure that critic roles help prevent
Does training on AI-generated content permanently degrade model quality? When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
single-agent FT collapse is a specific instance

Concept map

15 direct connections · 137 in 2-hop network ·dense cluster

Can multiple agents stay diverse during training… Does a model improve by arguing with itself? Does policy entropy collapse limit reasoning perfo… Why do LLMs generate novel ideas from narrow range… Why do multi-agent LLM systems converge without re… Does training on AI-generated content permanently …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

multi-agent finetuning preserves reasoning diversity by training agents on distinct data and roles — single-agent self-improvement saturates after one iteration