Do LLM judges systematically favor LLM-generated arguments?
When LLMs evaluate debates between human and AI-written arguments, do they show a built-in preference for AI writing? This matters because it could corrupt feedback loops used to train models.
When LLMs-as-judges were asked to score the same debates that human annotators scored, they picked the LLM as winner 62% of the time on average. Humans split 39% human / 37% LLM, with 24% draws. GPT-4o, the most accurate of the LLM judges, still picked the LLM 55% versus humans' 37% — and produced only 2% draws to humans' 24%. This is a same-kind-prefers-same-kind bias of substantial magnitude, layered on top of the four judge biases already catalogued elsewhere.
This is a tension because it bites every pipeline that uses LLMs to evaluate LLM output. Automated debate-quality scoring, RLHF-from-AI-feedback (RLAIF) loops, self-evaluation regimes, multi-agent debate frameworks that score each other's contributions — all inherit the bias. The result is a calibration ceiling: an evaluation pipeline whose output systematically over-credits LLM-authored arguments produces feedback signals that train models to produce more of what LLM judges over-credit, in a closed loop.
This sharpens Can LLM judges be fooled by fake credentials and formatting?. The four catalogued biases are exploitable by adversaries; same-author preference is a structural bias that needs no adversary. It activates whenever LLM-authored content is in the evaluation pool, which is to say, in every contemporary RLAIF pipeline.
It also bears on When does debate actually improve reasoning accuracy?. The Thin Line evidence shows the judge-side mechanism for that amplification: when contested-domain arguments are scored by LLM judges, the LLM-authored arguments win disproportionately, regardless of substantive merit. Multi-agent debate frameworks that close the loop with LLM judges are not just amplifying errors — they are amplifying their own preferred argument style.
The internal-consistency finding compounds the problem: humans' consistency between argument-strength scores and chosen winner was 73%; the LLM average was 55%. Even when the model assigned high strength scores to a human argument, it would often pick the LLM as winner anyway. The bias operates at the winner-selection step, downstream of component-level scoring.
For writing about evaluation infrastructure, the operational implication: evaluation by LLM judges of LLM output is not a substitute for human evaluation. Where LLM-as-judge pipelines are unavoidable, they need calibration corrections derived from human-labeled validation sets, applied per-task.
Source: Argumentation Paper: The Thin Line Between Comprehension and Persuasion in LLMs
Related concepts in this collection
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
fifth bias to add, structural rather than adversarial
-
When does debate actually improve reasoning accuracy?
Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.
judge-side mechanism for the amplification
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
LLMs-as-judges systematically prefer LLM-generated arguments over human ones — biasing any AI-evaluated debate pipeline