SYNTHESIS NOTE
Language, Text, and Discourse Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling

Can smaller models in panels outperform a single large judge?

Does replacing one large language model judge with a diverse panel of smaller models improve evaluation quality while reducing cost and bias? This matters because LLM-based evaluation is widespread but suffers from expense and family-specific bias.

Synthesis note · 2026-06-03 · sourced from Evaluations

LLM-as-judge evaluations usually lean on a single large model like GPT-4 — which is costly and introduces intra-model bias (the judge favors outputs from its own family). PoLL proposes a Panel of LLm evaluators: a larger number of smaller models drawn from disjoint model families, aggregating their votes. Across three judge settings and six datasets, PoLL outperforms a single large judge, exhibits less intra-model bias by construction (no single family dominates), and is over seven times cheaper. A key supporting finding: there is no single "best" judge across settings, but the panel performs consistently well.

The keeper is the ensemble logic applied to evaluation: diversity across model families cancels family-specific bias the way a jury's composition guards against any one juror's prejudice — and smaller-but-many beats larger-but-one on both cost and fairness.

This sits in the vault's evaluation/LLM-judge thread. It is a direct mitigation for Can LLM judges be fooled by fake credentials and formatting? and Do LLM judges systematically favor LLM-generated arguments? (disjoint-family panels dilute family-specific bias), and it complements the human-preference pole of Can crowdsourced votes reliably rank language models? with an automated multi-judge alternative.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 120 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

a panel of smaller LLM judges beats a single large judge with less intra-model bias at far lower cost