Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Paper · arXiv 2501.05707 · Published January 10, 2025

Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.

limited by the inherent quality of frontier models, preventing models from becoming better than the frontier of what the best existing models can accomplish. In addition, such an approach incurs high financial costs

Within our multiagent set of models, we propose to specialize models into distinct functionalities within the output generation procedure. First, we specialize a set of models to be generation agents that produce a set of initial responses given queries. Since initial responses can often be suboptimal, especially for challenging reasoning tasks, we further propose to specialize a set of models as critic agents that evaluate and refine the generations of other models. By using this set of distinct models in combination through multiagent debate (Du et al., 2023), we are able to construct a robust feedback loop for generating final responses, with experiments on other multiagent methods in Appendix D.

By training each model on distinct sets of data and roles, our approach fosters specialization across models and promotes diversification within the society of models. Consequently, our system can autonomously improve over many more rounds of finetuning compared to single-agent self-improvement methods (Figure 1).

Instead of building a single dataset to finetune each model, we propose creating different datasets to finetune different models.

In contrast, finetuning a single agent (”Single-agent FT”), as described in Section 2.2, shows that performance saturates after one iteration of finetuning and starts dropping afterward, indicating potential overfitting to generated responses. This issue occurs when the single model, after several finetuning cycles, becomes fixated on a small range of responses, which limits its diversity

Multiagent FT w/o summary removes the summarization step from the multiagent debate. Instead of summarizing, the responses from other agents are directly concatenated and presented to each agent. Summarization helps by eliminating redundant information and retaining the most critical points; therefore, omitting the summarization step can negatively impact performance.

Multiagent FT w/o critic: The critic agents evaluate the outputs from all generation agents and select or synthesize the best responses. Removing the critic agents and only finetuning the N generation agents could hurt performance, as the critic agents play a crucial role of refining the final output.

Single-agent FT involves finetuning only a single LLM as covered in Section 2.2 and using it as an agent in multiagent debate. This approach can easily lead to model collapse, where the agent generates similar responses after finetuning, thereby reducing diversity and hurting performance. Therefore, multiagent finetuning is necessary to maintain high performance in reasoning tasks.

Single-agent FT w/o Debate further eliminates the debate procedure, with the finetuned LLM generating responses directly. As shown in Du et al. (2023), multiagent debate can significantly boost performance, so removing it could lead to a performance drop.

Limitations. In comparison to existing works in single model finetuning, multiagent finetuning is substantially more expensive at both training and inference time as multiple copies of a model need to be trained and run. To run multiagent finetuning experiments on open source models, we used either four H100 GPUs or four A100 GPUs. Models took between 120GB - 240GB of GPU memory and inference took between 12-24 hours across multiple GPUs. To improve the training time of multiagent models, it may be interesting to instead share weights across different instances of models. To improve inference time in multiagent models, we can directly distill the debate procedure into a single modelor use quantization as part of finetuning.

Conclusion. In this paper, we have introduced a novel multiagent finetuning framework that significantly enhances the performance and diversity of language models. By employing a society of agents with distinct roles, our method effectively improves the feedback mechanism and overall output quality, mitigating the limitations inherent in single-agent self-improvement methods. This system allows for autonomous self-improvement through iterative finetuning, leading to substantial performance gains across a comprehensive suite of reasoning tasks. Importantly, our approach is versatile and can be applied to both open-source and proprietary LLMs, ensuring broad utility and impact.