Why does hypothesis attestation bias exist separately from frequency bias in NLI?
This explores why LLMs doing natural language inference (NLI) seem to judge whether a conclusion follows not by checking the logic, but by recalling whether they've 'seen' that conclusion before — and why that memorization habit looks like its own distinct failure rather than just a side effect of common phrases showing up a lot.
This explores why attestation bias — the habit of calling something 'entailed' simply because the hypothesis looks familiar from training — behaves like a separate failure from raw frequency bias in NLI. The cleanest evidence comes from McKenna et al.'s finding Do LLMs predict entailment based on what they memorized?: when you swap in a *random* premise that has nothing to do with the hypothesis, models still predict entailment as long as the hypothesis itself appears 'attested' in training data. That random-premise trick is the tell. If the bias were purely about frequency, you'd expect it to track how often a phrasing occurs; instead it tracks whether the proposition was *encountered as true* — the model is answering 'does this sound like something I've learned?' rather than 'does the premise support this?' The premise-hypothesis relationship, which is the entire point of inference, drops out.
Why would memorized truth and surface frequency come apart? Because the corpus suggests LLMs don't reason over logical form at all — they reason over meaning. When semantic content is stripped out and only the rules remain, performance collapses Do large language models reason symbolically or semantically?. So inference isn't a structural operation that frequency merely nudges; it's a semantic-association lookup. Attestation is what that lookup retrieves: a stored judgment about a specific proposition's truth, which is a different object from how common its words are.
The deeper reason these biases live in different places is architectural. Content effects work shows that for transformers, semantic content and logical form aren't separable channels — models reproduce human belief-bias signatures item-by-item across NLI, syllogisms, and Wason tasks Do language models show the same content effects humans do?. Believability of the conclusion and validity of the argument are entangled in the same representation. Attestation bias is essentially believability bias with a memory address: 'I have this proposition filed as true' overrides 'the premise in front of me doesn't license it.' That's also why prompting alone rarely fixes it — when prior training associations are strong, parametric knowledge dominates the actual context, and only intervening in the representations shifts the behavior Why do language models ignore information in their context?.
There's a useful provenance clue too: these tendencies are largely planted in pretraining, not instruction tuning. Models sharing a pretrained backbone show similar bias patterns regardless of finetuning data — finetuning only modulates what pretraining installed Where do cognitive biases in language models come from?. So attestation bias isn't a tuning artifact you can RLHF away; it's baked into what the model learned propositions *are*. And it sits alongside a family of related 'looks-right' shortcuts the corpus documents — models defaulting to conservative answers that mimic reasoning Are models actually reasoning about constraints or just defaulting conservatively?, or agreeing with claims they can independently verify as false Why do language models accept false assumptions they know are wrong?. The common thread: the model substitutes a familiarity or social signal for the actual inferential work.
The thing worth walking away with: 'attested' and 'frequent' are not the same coordinate. Frequency is about how often word-strings appear; attestation is about which propositions got stored as true. NLI tests the relationship *between* two statements — and a system that retrieves truth-by-memory will quietly answer a different question than the one being asked, even when its training corpus is perfectly balanced on frequency.
Sources 7 notes
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.