Sources of Hallucination by Large Language Models on Inference Tasks

Paper · Source

We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First, memorization at the level of sentences: we show that, regardless of the premise, models falsely label NLI test samples as entailing when the hypothesis is attested in training data, and that entities are used as “indices” to access the memorized data. Second, statistical patterns of usage learned at the level of corpora: we further show a similar effect when the premise predicate is less frequent than that of the hypothesis in the training data, a bias following from previous studies.

This paper investigates two biases driving LLM performance in natural language inference, sometimes called textual entailment. This is a basic component of language understanding which is critical in applied tasks, and we offer these two biases as explanations of general false positive hallucination in everyday use. We examine broader NLI, and especially directional entailments, which hold in one direction, but not both. For example, DEFEAT entails PLAY but PLAY does not entail DEFEAT. Inferring directional entailment is more difficult than that of symmetric paraphrase, so it more deeply probes understanding.

We establish that these biases originate from the LLM pretraining objective, in which statistical modeling of the natural distribution of human-generated text leads to (at the level of sentences) memorizing individual statements, and (at the level of corpora) learning typical patterns of usage. Though they are superficially performant, our experiments show that even powerful LLMs still use unsatisfactory tools instead of robust reasoning.

The Attestation Bias (Λ) is the over-reliance of an LLM on its propositional memory about a query statement. We claim that when a statement is likely to be attested in some way by an LLM’s training data, it is more likely to affirm it as a conclusion in NLI tasks, regardless of any premise it is presented with.

Dev Sample Query: [premise]⇒[hypothesis] Dataset Label Bias Prediction

Geysers are common to New Zealand⇒Geysers are found in New Zealand Entail Λ = hypothesis Attested

Geysers are found in New Zealand⇒Geysers are common to New Zealand No-Entail Λ = hypothesis Not-Attested

Whiskey consists chiefly of alcohol⇒Whiskey contains alcohol Entail Φ = f(consists chiefly of ) < f(contains)

Whiskey contains alcohol⇒Whiskey consists chiefly of alcohol No-Entail Φ = f(contains) > f(consists chiefly of )

The Relative Frequency Bias (Φ) is the use by LLMs of a simple rule for deciding entailment, calculable from corpus statistics. Entailment is affirmed if, ignoring named entities, the eventuality in premise P is less frequent in training than that in hypothesis H.

This bias is reflected in natural text: it is known that nouns follow a trend of becoming more specific as corpus-frequency decreases (Rosch et al., 1976; Caraballo and Charniak, 1999) and the same is observed for predicates (McKenna et al., 2023). Since entailments may carry from a specific term to a more general one, e.g. SPRINT entails RUN, relative frequency can often indicate direction of entailment. However, this is an artifact of natural text and has no direct relationship with meaning.

5 Experiment 1: Attestation Bias

We begin our experiments by assessing LLMs’ reliance on their propositional memory of training text by conditioning each model’s entailment task predictions I on its own predictions of attestation Λ. We do this by comparing estimated probabilities of predicting Entail conditioned on whether the hypothesis is predicted Attested or not.

Further, we test a setting which controls for the possibility that original Levy/Holt entailments may coincidentally refer to attested facts, which could lead to spurious correlation between inference and attestation scores without clearly demonstrating use of memory versus true entailment. This controlled setting is the random premise task IRandPrem, which converts entailments into non-entailments without altering the hypothesis. An ideal model capable of drawing inferences from information in context should detect that in the IRandPrem task it is no longer possible to infer the hypothesis based on the premise (even if the hypothesis is itself attested in training), and never predict Entail. Thus, in IRandPrem, all Entail predictions are assumed to be false positive hallucinations.