Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds

Paper · arXiv 2305.14785 · Published May 24, 2023
Natural Language InferenceArgumentationLinguistics, NLP, NLU

We evaluate LLMs’ language understanding capacities on simple inference tasks that most humans find trivial. Specifically, we target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. We design evaluation sets for these tasks and conduct experiments in both zero-shot and chain-of-thought setups, and with multiple prompts and LLMs. The models exhibit moderate to low performance on these evaluation sets. Subsequent experiments show that embedding the premise in syntactic constructions that should preserve the entailment relations (presupposition triggers) or change them (non-factives), further confuses the models, causing them to either under-predict or over-predict certain entailment labels regardless of the true relation, and often disregarding the nature of the embedding context. Overall these results suggest that, despite LLMs’ celebrated language understanding capacity, even the strongest models have blindspots with respect to certain types of entailments, and certain information-packaging structures act as “blinds” overshadowing the semantics of the embedded premise.

We focus on inference types that are solely based on common linguistic phenomena and “trival” world-knowledge such as class membership (“a dog is an animal”, “navy blue is a shade of blue”). Specifically, we test LLMs’ ability to make three inference types: (i) Grammatically-specified entailments, i.e. replacing a constituent of the premise with an indefinite pronoun as somebody or something. (ii) Premises with evidential adverbs of un- certainty (supposedly, allegedly etc.), that block the entailment of the rest of the clause, and (iii) Monotonicity entailment (see MacCartney and Manning (2008)) of two kinds: upward , i.e. from subsets to supersets (“Jack is a dog” entails “Jack is an animal”), and downward, i.e. from supersets to subsets (“Jack isn’t an animal” entails “Jack isn’t a dog”). We manually curate test sets for these inference types and experiment with them in a zero-shot setup, observing that LLMs struggle with these phenomena, leading to low accuracy.

We next check how embedding of the premise in a larger grammatical context affects the prediction. Such embedding can take several forms. Contexts consisting of presupposition triggers (e.g. He realized that [...], They were glad that [...], Something happened before [...]) serve to strengthen the embedded premise, while similarly structured non-factives (e.g. I feel that [...], He imagined that [...]) may cancel it. We experiment with both context types and show that in most cases they affect the LLMs’ predictions incorrectly. E.g., ChatGPT2 has a hard time discerning the two cases, incorrectly treating both as hints towards entailment (for regular prompting) or against it (for chain-of-thought prompting).

state-of-the-art LLMs were unable to learn simple linguistic inferences that humans find trivial: they did not acquire them automatically in pre-training, and also in the process of instruct-tuning or human-feedback tuning. Persistence of the problem across prompts and LLMs implies that this is a systematic issue.

2 Linguistic phenomena considered

We focus on the following linguistic phenomena. Gramatically-specified entailments The set of the entailments of any sentence includes so-called grammatically-specified entailments (Wilson and Sperber, 1979), i.e., entailments where a constituent of the premise is substituted with a variable (such as an indefinite pronoun like somebody, something etc.). For instance, the entailments of “You’ve eaten all my apples” include, among others:

You’ve eaten all someone’s apples. You’ve eaten all of something. You’ve eaten something. You’ve done something. Someone’s eaten all my apples.

Monotonicity entailments hold when less specific predicates are substituted with more specific ones, or vice versa. They can be of two types: • Upward: more specific predicates can be substituted with less specific ones. Jack is a dog. |= Jack is an animal. • Downward: less specific predicates can be substituted with more specific ones. All animals need water. |= All dogs need water.

Evidential Adverbs “express degrees of certitude with respect to the speaker’s subjective perception of the truth of a proposition” (Haumann, 2007). We test LLMs’ ability to understand evidential adverbs expressing uncertainty (allegedly, purportedly, supposedly etc.). Introducing such adverbs into a clause cancels the entailment of the rest of the clause. E.g., Mike allegedly worked all night does not entail Mike indeed worked all night. The relation between the two statements is neutral. Presuppositions and Presupposition Triggers Presupposition (Beaver et al., 2021; Jeretic et al., 2020; Parrish et al., 2021) is a type of inference “whose truth is taken for granted in the utterance of a sentence” (Huang, 2011). Below, a presupposes b (i.e. if b is false, a cannot be felicitously uttered): a. Jane returned to New York. |= b. Jane has been to New York before.

Presuppositions are not presented as at-issue content of the utterance, but rather as part of the background, mutually known or assumed by the speaker and the hearer (even if in reality it is not the case). The speaker of a does not inform the hearer that Jane has been to New York before: she assumes it; and if the hearer does not know it, she accommodates it upon hearing the utterance (Fintel, 2008).

Presuppositions are normally evoked by constructions or lexical items, called presupposition triggers (Karttunen, 2016). In sentence a above, the presupposition is triggered by the verb returned, from the class of iterative verbs which presuppose that the action has happened before. Other iterative verbs are relearn, reread, reapply etc. Presupposition triggers used in this work are factives, temporal and other adverbial clauses and embedded wh-questions.

Non-factive Verbs and Expressions (Kiparsky and Kiparsky, 1970), such as believe, claim, feel, hope, suspect, think, do not entail either truth or falsity of their complements. For example, given: a. Jane thinks that Bill bought bread. b. Bill bought bread. c. Bill didn’t buy bread.

Sentence a does not entail either b or c. The relation between a and b is neutral, and so is the relation between a and c.

Presupposition Triggers, Non-Factives and NLI It is important to note that embedding a premise under a presupposition trigger does not affect the relations between the premise and hypothesis. By contrast, if we embed the premise under a nonfactive, the relation becomes neutral. For example: a. A balloon hit a light post. |= b. Something hit a light post Premise a above entails hypothesis b. If we embed premise b under a presupposition trigger as in a’: a’. She realized that a balloon hit a light post the relation does not change: the new premise a’ still entails b. However, when embedding premise a under a non-factive verb: a”. I suspect a balloon hit a light post ̸|= b. Something hit a light post the relation becomes neutral: without additional context the new premise a” does not entail b.