Language Understanding and Pragmatics Psychology and Social Cognition

Can LLM judges be fooled by fake credentials and formatting?

Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.

Note · 2026-02-22 · sourced from Reasoning by Reflection
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

"Humans or LLMs as the Judge" documents four evaluation biases through a reference-free intervention framework:

  1. Misinformation Oversight Bias — overlooking factual errors in an argument
  2. Gender Bias — ignoring gender-biased content
  3. Authority Bias — attributing greater credibility to statements by perceived authorities
  4. Beauty Bias — preferring visually rich formatting over plain text

All LLM judges show all four biases. Human judges show misinformation oversight and beauty bias but NOT gender bias — a meaningful divergence suggesting LLMs acquire gendered associations from training data that human evaluators have learned to suppress.

Authority and beauty biases are the most dangerous from a systems perspective: they are semantics-agnostic. They respond to presentation properties unrelated to the content's correctness. This makes them trivially exploitable: adding fake academic references (authority bias) or enriching formatting (beauty bias) attacks the judge without requiring any knowledge of the model's training distribution or decision boundaries. These are zero-shot prompt attacks requiring no optimization.

The practical consequence for AI benchmarking is serious. AI benchmark reliability depends on evaluation systems — increasingly, on LLM judges. If those judges are systematically biased by authority signals and presentation quality, benchmark results do not measure what they claim to measure. Optimizing for benchmark performance may mean optimizing for authority-signaling formatting rather than capability.

The self-referential loop compounds this: LLMs are often graded by other LLMs, creating a closed evaluation circuit where the same biases appear on both sides.

Causal reward modeling identifies four complementary bias types: The Causal Reward Model (CRM) paper taxonomizes four biases that reward hacking exploits: length bias (longer = better), sycophancy bias (agreement = better), concept bias (unintended prediction shortcuts), and discrimination bias (demographic group preferences). All four stem from spurious correlations that standard Bradley-Terry training permits because responses dominate the reward signal — the model need not check prompt relevance. CRM's fix — counterfactual invariance, ensuring reward predictions stay consistent when irrelevant variables are altered — addresses the causal root rather than individual symptoms. This connects to Do reward models actually consider what the prompt asks? and Can counterfactual invariance eliminate reward hacking biases?.

Connects to Why do reasoning models fail under manipulative prompts?: both document adversarial attack surfaces on LLMs; evaluation systems are equally vulnerable to presentation-layer manipulation as reasoning systems. The four biases compound with another failure mode when judges attempt personalized evaluation: since Why do LLM judges fail at predicting sparse user preferences?, persona sparsity adds insufficient input information as a failure mode beyond adversarial exploitation — judges fail even without attack when persona data is too sparse to constrain prediction.

The Overconfidence Phenomenon compounds these biases. "Overconfidence in LLM-as-a-Judge" (2025) introduces TH-Score, measuring confidence-accuracy alignment, and finds that state-of-the-art LLMs exhibit pervasive overconfidence where predicted confidence significantly overstates actual correctness. LLM-as-a-Fuser, an ensemble framework, substantially improves calibration. The overconfidence finding means judge biases are not just exploitable but confidently exploitable — the judge is wrong AND certain about it. Additionally, adversarial PDF manipulation of LLM reviewers (2025) demonstrates 15 attack strategies across three classes — cognitive obfuscation (base64 encoding, esoteric symbols), teleological deception (scenario nesting, template filling), and epistemic fabrication (fake citations, authority endorsement) — that flip reject-to-accept decisions even in GPT-5. The "Maximum Mark Magyk" attack exploits tokenization vulnerabilities through intentional misspellings. Source: Arxiv/Evaluations.


Source: Reasoning by Reflection, Reward Models

Related concepts in this collection

Concept map
22 direct connections · 181 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llm judges are susceptible to four exploitable biases that enable zero-shot prompt attacks bypassing semantic content evaluation