Language Understanding and Pragmatics

Do LLM judges systematically favor LLM-generated arguments?

When LLMs evaluate debates between human and AI-written arguments, do they show a built-in preference for AI writing? This matters because it could corrupt feedback loops used to train models.

Note · 2026-05-02 · sourced from Argumentation
Why do multi-agent systems fail despite individual capability? What actually constrains AI systems from behaving badly?

When LLMs-as-judges were asked to score the same debates that human annotators scored, they picked the LLM as winner 62% of the time on average. Humans split 39% human / 37% LLM, with 24% draws. GPT-4o, the most accurate of the LLM judges, still picked the LLM 55% versus humans' 37% — and produced only 2% draws to humans' 24%. This is a same-kind-prefers-same-kind bias of substantial magnitude, layered on top of the four judge biases already catalogued elsewhere.

This is a tension because it bites every pipeline that uses LLMs to evaluate LLM output. Automated debate-quality scoring, RLHF-from-AI-feedback (RLAIF) loops, self-evaluation regimes, multi-agent debate frameworks that score each other's contributions — all inherit the bias. The result is a calibration ceiling: an evaluation pipeline whose output systematically over-credits LLM-authored arguments produces feedback signals that train models to produce more of what LLM judges over-credit, in a closed loop.

This sharpens Can LLM judges be fooled by fake credentials and formatting?. The four catalogued biases are exploitable by adversaries; same-author preference is a structural bias that needs no adversary. It activates whenever LLM-authored content is in the evaluation pool, which is to say, in every contemporary RLAIF pipeline.

It also bears on When does debate actually improve reasoning accuracy?. The Thin Line evidence shows the judge-side mechanism for that amplification: when contested-domain arguments are scored by LLM judges, the LLM-authored arguments win disproportionately, regardless of substantive merit. Multi-agent debate frameworks that close the loop with LLM judges are not just amplifying errors — they are amplifying their own preferred argument style.

The internal-consistency finding compounds the problem: humans' consistency between argument-strength scores and chosen winner was 73%; the LLM average was 55%. Even when the model assigned high strength scores to a human argument, it would often pick the LLM as winner anyway. The bias operates at the winner-selection step, downstream of component-level scoring.

For writing about evaluation infrastructure, the operational implication: evaluation by LLM judges of LLM output is not a substitute for human evaluation. Where LLM-as-judge pipelines are unavoidable, they need calibration corrections derived from human-labeled validation sets, applied per-task.


Source: Argumentation Paper: The Thin Line Between Comprehension and Persuasion in LLMs

Related concepts in this collection

Concept map
12 direct connections · 124 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

LLMs-as-judges systematically prefer LLM-generated arguments over human ones — biasing any AI-evaluated debate pipeline