Why do LLMs fail at understanding what remains unsaid?

How LLMs handle pragmatics, inference, and meaning beyond literal text patterns.

Topic Hub · 46 linked notes · 7 sections

View as

Statistical Regularities and Pragmatics

8 notes

Why do language models fail at communicative optimization?

LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?

Why don't LLMs shorten messages like humans do?

Humans naturally develop shorter, efficient language during conversations. Do multimodal LLMs exhibit this same spontaneous adaptation, or do they lack this communicative behavior?

Can we teach LLMs to form linguistic conventions in context?

Humans naturally shorten references as conversations progress, but LLMs don't adapt their language for efficiency even when they understand their partners do. Can training on coreference patterns teach this convention-forming behavior?

Can language models adapt implicature to conversational context?

Do large language models flexibly modulate scalar implicatures based on information structure, face-threatening situations, and explicit instructions—as humans do? This tests whether pragmatic computation is truly context-sensitive or merely literal.

Why do speakers deliberately use ambiguous language?

Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.

Can language models recognize when text is deliberately ambiguous?

Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.

Do standard NLP benchmarks hide LLM ambiguity failures?

When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?

Why do readers interpret the same sentence so differently?

How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.

Expertise and Validity Claims

2 notes

Can AI anticipate whether expert claims will be socially valid?

Expert knowledge involves more than correctness—it requires predicting whether fellow experts will accept a claim as valid. Can AI systems make this social judgment, or are they limited to statistical accuracy?

Can language models distinguish expert arguments from common assumptions?

Whether LLMs can recognize the difference between groundbreaking insights from recognized experts and widely repeated textbook claims, and why this distinction matters for understanding argumentative force.

Argumentation Structure and Dialogue

9 notes

Can LLMs identify the hidden assumptions that make arguments work?

LLMs recognize what arguments claim and what evidence they offer, but struggle to identify implicit warrants—the unstated principles that connect evidence to conclusion. This matters because valid reasoning requires understanding these hidden logical bridges.

Why do different people reconstruct the same argument differently?

When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?

Can critical questions improve how language models reason?

Does structuring prompts around argumentation theory's warrant-checking questions force language models to perform deeper reasoning rather than surface pattern matching? This matters because models might produce correct answers without actually reasoning correctly.

Can models learn argument quality from labeled examples alone?

Explores whether fine-tuning on quality-labeled examples teaches models the underlying criteria for evaluating arguments, or merely surface patterns. Matters because high-stakes assessment tasks depend on reliable, transferable quality judgment.

Can disagreement be resolved without either party fully yielding?

Explores whether dialogue can move past winner-take-all debate or forced consensus to genuine mutual adjustment. Matters for AI systems that need to work through real disagreement with users.

Does any single persuasion technique work for everyone?

Can fixed persuasion strategies like appeals to authority or social proof be reliably applied across different people and situations, or do they require adaptation to individual traits and context?

Where does AI's persuasive power actually come from?

Explores which techniques make AI most persuasive—and whether the usual suspects like personalization and model size are actually the main drivers. Matters because it reshapes where to focus AI safety concerns.

Can opening politeness patterns predict whether conversations will turn hostile?

Do pragmatic politeness features in first exchanges—hedging, greetings, indirectness—reliably signal whether a conversation will later derail into personal attacks? Understanding early linguistic markers could help identify and prevent online hostility.

What makes explanations work in real conversation?

Does explanation quality depend on how dialogue partners interact—testing understanding, adjusting based on feedback, and coordinating their communicative moves—rather than just information content alone?

Presupposition, Inference, and Entailment

9 notes

Why do language models accept false assumptions they know are wrong?

Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.

Why do language models avoid correcting false user claims?

Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.

Why are presuppositions more persuasive than direct assertions?

Explores why presenting information as shared background rather than as a claim makes it more persuasive to audiences. This matters because it reveals how language structure itself can bypass critical evaluation.

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Does projection strength vary by context or by word type?

Standard accounts treat presupposition projection as categorical, but do English expressions actually project uniformly? This question explores whether context and discourse role determine how strongly content survives embedding.

Do language models miss presuppositions that arise from context?

Presuppositions come from two sources: fixed word meanings and conversational dynamics. Can LLMs that learn trigger patterns detect presuppositions that emerge from discourse accommodation rather than lexical items?

Why do language models struggle with questions containing false assumptions?

Do LLMs reliably detect and reject questions built on false premises? The (QA)2 benchmark tests this directly, measuring whether models can identify problematic assumptions embedded in naturally plausible questions.

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Deception, Trust, and Pragmatic Manipulation

7 notes

Can NLP detect deception through distinct linguistic patterns?

Do different deception mechanisms (distancing, cognitive load, reality monitoring, verifiability avoidance) each leave detectable linguistic fingerprints that NLP systems can identify and measure?

Do liars and listeners coordinate their language during deception?

Explores whether conversational partners unconsciously synchronize their linguistic styles more during deceptive exchanges than truthful ones, and what this coordination reveals about how deception unfolds in real time.

How do people simultaneously manipulate information across multiple dimensions?

Information Manipulation Theory maps deception onto four Gricean dimensions operating at once. Understanding these simultaneous manipulations reveals why humans struggle to detect lies despite having the knowledge to do so.

Can humans detect AI by passively reading its text?

When people read AI-generated transcripts without the ability to ask follow-up questions, can they tell it apart from human writing? This matters because most real-world AI encounters are passive.

How does AI-generated false experience differ linguistically from human deception?

When AI writes about experiences it never had, does it leave distinct linguistic traces that differ measurably from intentional human lies? Understanding these differences could reveal how AI falsity is fundamentally different in structure.

Does AI fact-checking actually help people spot misinformation?

An RCT tested whether AI fact-checks improve people's ability to judge headline accuracy. The results reveal asymmetric harms: AI errors push users in the wrong direction more than correct labels help them.

Why do fake news detectors flag AI-generated truthful content?

Explores why systems trained to detect deception misclassify LLM-generated text as fake. The bias may stem from AI linguistic patterns rather than content veracity, raising questions about what these detectors actually measure.

Pass 3 Additions (2026-05-03)

4 notes

Can generative AI scale personality-targeted political persuasion?

Does removing the human-writing bottleneck through generative AI make it feasible to target voters at scale based on individual psychological traits? This matters because it could reshape political microtargeting economics and capabilities.

Where does LLM recommendation bias actually come from?

Do conversational AI systems inherit popularity bias from their training data or from the datasets they're deployed on? Understanding the source matters for knowing how to fix it.

Do LLMs in conversational recommendation systems use collaborative or content knowledge?

Conversational recommenders powered by LLMs might rely on either collaborative signals (user interaction patterns) or content/context knowledge (semantic understanding). Understanding which signal dominates would reveal how to design and deploy these systems effectively.

What makes conversational recommenders hard to build well?

Most assume the challenge is language fluency, but what if the real problem is managing mixed-initiative dialogue—where both users and systems take turns driving the conversation?