ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Paper · arXiv 2308.07201 · Published August 14, 2023

multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agentbased approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. We derive insights and lessons from practical scenarios where humans instigate group discussions for brainstorming and propose different communication strategies within ChatEval.

Furthermore, we find that the diverse role prompts (different personas) are essential

in the multi-agent debate process; that is, utilizing the same role description in the prompt can lead to a degradation in performance.

In view of the impressive text understanding and instruction-following capabilities of recent LLMs, a body of literature (Liu et al., 2023b; Chiang & Lee, 2023; Gao et al., 2023; Shen et al., 2023) has adopted LLM as an evaluator to assess the quality of responses to open-ended questions or traditional NLG tasks, including dialogue response generation and summarization. This methodology is dubbed LLM-as-a-judge (Zheng et al., 2023). Findings from these researches indicate that LLM can mimic human behavior and provide evaluations that correspond with human judgments, revealing a potentially scalable and transparent alternative to costly and laborious human evaluations.

While a single powerful LLM can already tackle various missions, emerging studies suggest that multiple LLMs can further improve one another through debate and cooperation (Li et al., 2023a; Liang et al., 2023). By incorporating multiple LLMs into an integrated group and designing specific interaction mechanisms, different LLMs can engage in proposing and deliberating unique responses and thought processes across several rounds. This approach leads to enhanced factuality of generated responses (Du et al., 2023) and improvement in the completion of arduous tasks (Li et al., 2023a; Qian et al., 2023). Furthermore, the multi-agent group also addresses and mitigates the Degeneration-of-Thought (DOT) problem (Liang et al., 2023).

Debater Agents. Debater agents are one of the most significant components in our framework. We treat each individual LLM as an agent and ask them to generate their response from the given prompt2. Responses from other agents are served as chat history which will be replaced in the prompt template. After configuring the agents, we then start the group debate where each agent autonomously receives responses from the others and, in turn, delivers its own responses to them. It should be noted that the whole process does not require human intervention.

Diverse Role Specification. As presented in Section 1, diverse role specification is necessary for the framework as well. Although all the agents share a common prompt template, we substitute the role description slot with diverse role prompts, specifying distinct personalities for different agents. We take inspiration from Wu et al. (2023) and formulate an analogous role description.

Communication Strategy. How to maintain the chat history is another significant issue in ChatEval. In our work, we use a more intuitive term to illustrate the maintenance of the chat history called communication strategy. In a nutshell, different communication strategies can be seen as different approaches to maintaining and manipulating their chat history. As is shown in Figure 2,We primarily design three different communication strategies and illustrate them as follows:

One-By-One. During each round of the debate, the debater agents take turns in a set order to generate their response based on the current observation. When it’s time for a debater agent to respond, we directly concatenate what previous other agents have said into its chat history slot.
Simultaneous-Talk. Unlike the one-by-one strategy, we carry out an alternative communication strategy called simultaneous-talk, where debater agents are prompted to asynchronously generate responses in each iteration of the discussion to nullify the impact of the speaking order.
Simultaneous-Talk-with-Summarizer. The main difference between this strategy and simultaneous-talk is that we additionally employ another LLM as a summarizer. At the end of each iteration of the debate, we prompt this extra LLM to summarize the messages conveyed so far and concatenate this summarization into all debater agents’ chat history slots.