Can smaller models in panels outperform a single large judge?
Does replacing one large language model judge with a diverse panel of smaller models improve evaluation quality while reducing cost and bias? This matters because LLM-based evaluation is widespread but suffers from expense and family-specific bias.
LLM-as-judge evaluations usually lean on a single large model like GPT-4 — which is costly and introduces intra-model bias (the judge favors outputs from its own family). PoLL proposes a Panel of LLm evaluators: a larger number of smaller models drawn from disjoint model families, aggregating their votes. Across three judge settings and six datasets, PoLL outperforms a single large judge, exhibits less intra-model bias by construction (no single family dominates), and is over seven times cheaper. A key supporting finding: there is no single "best" judge across settings, but the panel performs consistently well.
The keeper is the ensemble logic applied to evaluation: diversity across model families cancels family-specific bias the way a jury's composition guards against any one juror's prejudice — and smaller-but-many beats larger-but-one on both cost and fairness.
This sits in the vault's evaluation/LLM-judge thread. It is a direct mitigation for Can LLM judges be fooled by fake credentials and formatting? and Do LLM judges systematically favor LLM-generated arguments? (disjoint-family panels dilute family-specific bias), and it complements the human-preference pole of Can crowdsourced votes reliably rank language models? with an automated multi-judge alternative.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
PoLL's disjoint-family panel dilutes the single-judge biases this documents
-
Do LLM judges systematically favor LLM-generated arguments?
When LLMs evaluate debates between human and AI-written arguments, do they show a built-in preference for AI writing? This matters because it could corrupt feedback loops used to train models.
intra-model preference bias PoLL is designed to reduce
-
Can crowdsourced votes reliably rank language models?
Explores whether large-scale human preference voting from casual users produces valid model rankings comparable to expert judgment, and what makes such crowdsourced evaluation trustworthy at scale.
human-preference evaluation pole; PoLL is the automated multi-judge counterpart
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
- Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
- The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
- ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
- Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Comparing Human and AI Therapists in Behavioral Activation for Depression: Cross-Sectional Questionnaire Study
- Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Original note title
a panel of smaller LLM judges beats a single large judge with less intra-model bias at far lower cost