Language Understanding and Pragmatics

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Note · 2026-02-21 · sourced from Natural Language Inference
Where exactly does language competence break down in LLMs? How should researchers navigate LLM reasoning research?

"Simple Linguistic Inferences of LLMs" targets inferences humans find trivial — grammatically-specified entailments ("You've eaten all my apples" entails "Someone ate something"), evidential adverbs of uncertainty ("allegedly" cancels the entailment of the clause), and monotonicity entailments (specific→general). LLMs show moderate-to-low performance on all three.

But the more revealing finding is what happens when the premise is embedded in grammatical contexts. Two types of embedding contexts should have opposite effects:

LLMs cannot make this discrimination. ChatGPT in regular prompting mode treats both presupposition triggers and non-factives as hints toward entailment. In chain-of-thought mode, it treats both as hints against entailment. The embedding context overwhelms the semantics of the embedded content, acting as a "blind" that masks the relevant inferential relationships.

This is a different kind of failure from general reasoning difficulty — these are structural failures where syntactic packaging overrides semantic content. The model responds to the embedding verb (factive vs. non-factive) as a surface cue rather than computing its effect on the entailment relation. This is precisely the pattern Can models pass tests while missing the actual grammar? predicts: surface cues substituting for structural analysis.

The persistence across multiple prompts and LLMs confirms this is systematic, not incidental — "a systematic issue" in the paper's words.


Source: Natural Language Inference

Related concepts in this collection

Concept map
13 direct connections · 89 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

presupposition triggers and non-factive verbs are embedding blinds that systematically miscalibrate llm entailment predictions