Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Paper · arXiv 2507.21028 · Published July 28, 2025

emerging “LLM-as-a-judge” paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to generate multi-dimensional feedback

We evaluate MAJ-EVAL on two challenging domain-specific real-world tasks: (1) question-answer generation (QAG) for children’s storybook reading (Xu et al., 2022) and (2) multi-document summarization of medical literature (DeYoung et al., 2021).

Most notably, they reflect single-model bias, where judgments are constrained by the model’s own training data and reasoning style, thus may fail to simulate multi-stakeholder perspectives in real-world evaluations (Yao et al., 2024).

To mitigate the limitations of single-LLM evaluation, recent work has extended the paradigm to multi-agent setups, where multiple LLM agents, each adopting a distinct persona or evaluative role, collaborate or debate to arrive at a final assessment (Chen et al., 2023; Zhu et al., 2023). Examples include ChatEval (Chan et al., 2023), which assigns agents to pre-defined roles such as “general public” or “critic,” and MADISSE (Koupaee et al., 2025), which frames evaluation as a debate between agents with opposing initial stances. These systems improve diversity in judgment and better mirror real-world evaluative complexity. However, most of these approaches still rely on manually crafted personas and predefined evaluation dimensions, limiting reproducibility and cross-task generalization (Szymanski et al., 2025; Gebreegziabher et al., 2025). For example, an agent labeled as a “critic” in one task may not exhibit the same evaluative priorities in another, and a dimension like “factual consistency” may not translate well from summarization to dialogue generation.

As shown in Figure 1, MAJ-EVAL enables researchers to evaluate model-generated content by (1) automatically extracting stakeholder perspectives from domain-specific documents and constructing diverse agent personas grounded in those perspectives, and (2) orchestrating in-group debates among these agents to produce final, multidimensional evaluation scores.

3.1 Stakeholder Persona Creation

The first stage of MAJ-EVAL focuses on creating personas that faithfully represent the diverse evaluative dimensions found in real-world stakeholder groups. To ensure both coverage and credibility, persona creation follows a two-step process: (1) extracting evaluative dimensions from research publications, and (2) constructing personas based on those extracted perspectives.

Step 1: Evaluative Dimension Extraction.

Given a list of documents of domain-specific tasks (e.g., research papers) L = {l1, . . . , ln}, MAJEVAL uses an LLM Mθ to identify relevant stakeholders and extract their associated perspectives (i.e., evaluative dimensions). Each document is parsed to locate stakeholders (e.g., “parents,” “clinicians”) and their descriptive attributes (e.g., priorities, values), along with evidence-based evaluation dimensions (e.g., “focus on grammar correctness”). The output for each document li is a structured list of stakeholder tuples sij = (nij , cij , Vij), where nij denotes the stakeholder’s name, cij is their description, and Vij is a set of (dimension, evidence) pairs. For instance, in the task of QAG for children’s story reading, one extracted evaluative dimension of the parents is “Parents expect questions to stimulate creativity, critical thinking, and curiosity rather than factual recall...”, with the evidence of “The majority of participants felt that current AI tools were ‘silly’...” from a paper that explores parents’ expectations and perceptions of AI-assisted reading tools for children (Sun et al., 2024).

To unify overlapping roles and ensure coherent persona design, MAJ-EVAL aggregates similar stakeholders into groups using semantic clustering via LLM Mθ. Within each group, redundant or semantically close dimensions are automatically merged, resulting in a consolidated view of each stakeholder group. For example, education technology developers who emphasize “system usability” and AI developers who promote “system robustness” are grouped under a “system developer” stakeholder group with multiple evaluative dimensions. Following prior work showing that diverse perspectives can enhance the debate process (Liang et al., 2024), MAJ-EVAL retains distinct evaluative dimensions within each group to preserve diversity.

Step 2: Dimension-Based Persona Construction. For each consolidated dimension within a stakeholder group, MAJ-EVAL constructs a detailed persona: pij = Mθ(ci, vij , eij). Inspired by prior work on LLM-based role-play agents (Chen et al., 2025a), each persona includes five key attributes: (1) demographic information (e.g., name, age, profession), (2) evaluative dimension (from earlier perspective extraction), (3) domain specialty, (4) psychological traits, and (5) social relationships. These personas serve as the basis for instantiating stakeholder-aligned agents during evaluation. We include examples of constructed personas in Table 10 and the corresponding prompt in Table 13. In addition, Appendix A.7 presents an example of MAJ-EVAL’s complete persona creation workflow.

3.2 Multi-Agent-as-Judge Debate Evaluation

In the second stage of MAJ-EVAL, the constructed personas are instantiated as LLM-based agents that engage in a multi-agent-as-judge debate evaluation (Table 14 presents the instantiation prompt). Each stakeholder group (e.g., teachers, clinicians) evaluates model-generated outputs through in-group deliberation (in-group multiagent free debate), simulating how real-world stakeholders might discuss, disagree, and eventually converge on evaluation judgments. The debate process is divided into three phases: (1) individual agentas- a-judge evaluation, (2) multi-agent in-group free debate, and (3) aggregation of scores into a final group judgment (see Figure 2).

Phase 1: Individual Agent-as-a-Judge. Each stakeholder agent begins by independently assessing the generated output according to their unique perspective and expertise. This phase aims to capture a diversity of opinions, reflecting how different stakeholders may initially interpret the same content in task-specific ways. The prompt for this phase is presented in Table15.

Phase 2: Multi-Agent In-Group Free Debate. Next, the agents engage in an open-ended multiturn debate within each group. Moderated by a coordinating agent, the debate unfolds dynamically, prioritizing agents with unresolved disagreements or unaddressed perspectives. Agents challenge, reflect on, or reinforce each other’s views and revise their evaluations as needed. This phase encourages surfacing blind spots, resolving conflicts, and generating more refined judgments. We include the prompt for phase 2 in Table 16.

Phase 3: Aggregation. Finally, an aggregator agent aggregates the updated evaluations across all agent groups in two ways: (1) synthesizing the qualitative feedback from all stakeholder agents’ final evaluations and (2) computing an average score of each group’s post-debate quantitative ratings. Table 17 shows the prompt for this phase.