Agentic and Multi-Agent Systems Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can multiple agents stay diverse during training together?

Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.

Note · 2026-02-23 · sourced from Agents Multi
What actually constrains AI systems from behaving badly? What makes multi-agent teams actually perform better?

Single-agent self-improvement through iterative finetuning hits a wall fast. After one round of finetuning on its own generated outputs, performance saturates and begins to drop — the model becomes fixated on a narrow range of responses, limiting diversity and degrading accuracy. This is the training-time analog of Does a model improve by arguing with itself? at inference time: a single model trapped in its own distribution.

The multiagent finetuning framework (Du et al., 2025) proposes a structural fix: instead of training one model iteratively, train a society of models, each starting from the same base but independently specialized through distinct training data generated via multi-agent interactions. Generation agents produce initial responses; critic agents evaluate and refine them through debate. Each model sees different data because the interactions are role-dependent.

The mechanism works because role specialization prevents convergence to a single mode. When one model is trained to generate and another to critique, their training distributions diverge, maintaining the diversity that single-agent training destroys. The summarization step between debate rounds further helps by eliminating redundant information and retaining critical points — removing summarization hurts performance. Removing critics also degrades output quality, confirming that the evaluative role is load-bearing, not decorative.

This connects directly to Does policy entropy collapse limit reasoning performance in RL?: the entropy collapse that limits RL training is mitigated when multiple agents maintain distinct policy distributions. And since Why do LLMs generate novel ideas from narrow ranges?, the training-time diversity preservation through multi-agent specialization could address the output-time diversity problem upstream.

The cost is real — multiple model copies for training and inference. But the finding that single-agent FT collapses after one iteration means the choice is not "cheap single-agent" vs "expensive multi-agent" but "one iteration of productive training" vs "sustained improvement across many rounds."


Source: Agents Multi

Related concepts in this collection

Concept map
15 direct connections · 137 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

multi-agent finetuning preserves reasoning diversity by training agents on distinct data and roles — single-agent self-improvement saturates after one iteration