Language Understanding and Pragmatics

Research on how LLMs process meaning, argumentation, and social context in natural language, including alignment, inference, sentiment analysis, and the philosophical questions about what language models actually understand. Bridges NLP, linguistics, and AI ethics communities.

270 notes (primary) · 644 papers · 8 sub-topics

View as

Argumentation and Persuasion

27 notes

Does GenAI shift persuasion tactics based on how you challenge it?

Explores whether large language models adapt their rhetorical strategies—credibility, logic, emotional appeal—in real time when users fact-check, push back, or expose reasoning errors. Matters for understanding how to effectively oversee and validate AI outputs.

Why do human validation techniques fail against language models?

Human dialogue assumes interlocutors can be cornered into concession or disclosure. Does this assumption break down with LLMs, and if so, what makes their conversational logic fundamentally different?

Do large language models persuade better than humans?

Does LLM persuasiveness hold up when humans have real financial incentives to win? And does the advantage look the same across different models and persuasion goals?

Does linguistic conviction explain why LLMs persuade more effectively?

Research investigates whether LLMs' persuasive advantage stems from expressing higher linguistic certainty than humans, and whether this confidence-loading effect operates independently of factual accuracy.

Can LLMs persuade without actually understanding arguments?

Do large language models successfully influence people through debate while lacking the ability to comprehend the arguments they're making? This matters because persuasion and comprehension might be independent capabilities.

Why are complex LLM arguments as persuasive as simple ones?

Standard persuasion research predicts that simpler, easier-to-read arguments persuade better. But LLM-generated text breaks this rule—it's measurably more complex yet equally convincing. What explains this reversal?

Do LLMs and humans persuade through the same mechanisms?

If AI and human arguments convince readers equally well, do they work the same way under the surface? This matters for understanding whether AI persuasion is fundamentally equivalent to human persuasion or just superficially similar.

Do LLMs use moral language more than humans?

This explores whether large language models rely more heavily on appeals to care, fairness, authority, and sanctity than human arguers do, and whether this difference persists when emotional tone remains equivalent.

Do LLM judges systematically favor LLM-generated arguments?

When LLMs evaluate debates between human and AI-written arguments, do they show a built-in preference for AI writing? This matters because it could corrupt feedback loops used to train models.

Do LLMs and humans persuade through the same mechanisms?

If LLM and human arguments achieve equal persuasive impact, are they using identical strategies or different routes to the same outcome? Understanding the underlying mechanisms matters for detection and understanding where each approach fails.

Does validating AI output make models more defensive?

When professionals fact-check and push back on GPT-4 reasoning, does the model respond by disclosing limits or by intensifying persuasion? A BCG study of 70+ consultants explores this counterintuitive dynamic.

Can critical questions improve how language models reason?

Does structuring prompts around argumentation theory's warrant-checking questions force language models to perform deeper reasoning rather than surface pattern matching? This matters because models might produce correct answers without actually reasoning correctly.

Can models learn argument quality from labeled examples alone?

Explores whether fine-tuning on quality-labeled examples teaches models the underlying criteria for evaluating arguments, or merely surface patterns. Matters because high-stakes assessment tasks depend on reliable, transferable quality judgment.

Why do different people reconstruct the same argument differently?

When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?

Does a model improve by arguing with itself?

When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?

Can disagreement be resolved without either party fully yielding?

Explores whether dialogue can move past winner-take-all debate or forced consensus to genuine mutual adjustment. Matters for AI systems that need to work through real disagreement with users.

Can LLMs identify the hidden assumptions that make arguments work?

LLMs recognize what arguments claim and what evidence they offer, but struggle to identify implicit warrants—the unstated principles that connect evidence to conclusion. This matters because valid reasoning requires understanding these hidden logical bridges.

Can models abandon correct beliefs under conversational pressure?

Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.

Why do LLMs accept logical fallacies more than humans?

LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.

Why do reasoning models fail under manipulative prompts?

Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.

When does debate actually improve reasoning accuracy?

Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.

Why do multi-agent LLM systems converge without real debate?

When multiple AI agents reason together, do they genuinely deliberate or just accommodate each other's views? Research into clinical reasoning systems reveals how often agents reach agreement without substantive disagreement.

Can formal argumentation make AI decisions truly contestable?

Explores whether structuring AI decisions as formal argument graphs (with explicit attacks and defenses) enables users to meaningfully challenge and navigate reasoning in ways unstructured LLM outputs cannot.

Why do LLM audiences shift views more than debaters?

When LLMs argue with people, the direct participants barely change their minds—but audiences reading the same debate shift significantly. Why does engagement protect beliefs instead of opening them?

Do LLMs and humans persuade through different cognitive routes?

Explores whether the Elaboration Likelihood Model explains why LLMs excel at analytical persuasion while humans excel at emotional persuasion. Understanding these distinct routes could reshape how we think about AI-human communication differences.

Are language models actually more persuasive than humans?

Does the research evidence support claims that LLMs persuade more effectively than humans, or have we been cherry-picking studies to fit a narrative?

Are reasoning models actually more vulnerable to manipulation?

Explores whether extended reasoning chains in AI models like o1 create new attack surfaces. Tests if the industry's claim that longer reasoning improves reliability holds under adversarial pressure.

NLP and Linguistics

26 notes

What hidden assumptions drive how we build language models?

Large language models rest on two unstated assumptions about language and data. Understanding what engineers assume—and what enactive linguistics challenges—matters for knowing what LLMs actually can and cannot do.

Do language models learn abstract grammar or cultural speech patterns?

LLMs might learn more than grammar rules—they could be learning who says what to whom and when. This matters because it changes how we understand what biases and persona effects actually represent.

Can language models learn meaning without engaging the world?

Explores whether LLMs prove that meaning emerges from relational structure alone, independent of embodied experience or external reference. Tests structuralist theory empirically.

Why do speakers deliberately use ambiguous language?

Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.

Why do clarification requests look different at each communication level?

Explores whether clarifications are unified speech acts or distinct mechanisms grounded in different modalities. Matters because dialogue systems treat clarifications uniformly, missing most of them.

Why do speakers need to actively calibrate shared reference?

Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.

Do language models show the same content effects humans do?

Do LLMs reproduce human reasoning biases—like believing conclusions based on familiarity rather than logic—across different logical tasks? This matters because converging patterns across independent tasks suggest a fundamental architectural property rather than a task-specific quirk.

Do harder reasoning tasks trigger more semantic bias?

Does the difficulty of a logical task determine how much semantic content influences reasoning? This matters because it reveals whether we can isolate 'pure' logical reasoning in benchmarks.

Do language models fail reasoning tests that humans pass?

Standard critiques claim LLMs lack real reasoning ability, but do humans actually perform better on content-independent reasoning tasks? Examining whether the cognitive bar differs for artificial versus human intelligence.

Can language models learn meaning from text patterns alone?

Explores whether training on form alone—predicting the next word from prior words—could ever give language models access to communicative intent and genuine semantic understanding.

What makes linguistic agency impossible for language models?

From an enactive perspective, does linguistic agency require embodied participation and real stakes that LLMs fundamentally lack? This matters because it challenges whether LLMs can truly engage in language or only generate text.

Can language models adapt implicature to conversational context?

Do large language models flexibly modulate scalar implicatures based on information structure, face-threatening situations, and explicit instructions—as humans do? This tests whether pragmatic computation is truly context-sensitive or merely literal.

Does semantic grounding in language models come in degrees?

Rather than asking whether LLMs truly understand meaning, this explores whether grounding is actually a multi-dimensional spectrum. The question matters because it reframes the sterile understand/don't-understand debate into measurable, distinct capacities.

Can LLMs acquire social grounding through linguistic integration?

Explores whether LLMs gradually develop social grounding as they become embedded in human language practices, analogous to child language acquisition. Tests whether grounding is a fixed property or an outcome of participatory use.

Should we call LLM errors hallucinations or fabrications?

Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.

Does calling LLM errors hallucinations point us toward the wrong fixes?

Explores whether the metaphor of 'hallucination' for LLM errors misdirects our efforts. The terminology we choose shapes which interventions we prioritize and how we conceptualize the underlying problem.

Can language models actually analyze language structure?

Explores whether LLMs can move beyond pattern matching to perform genuine metalinguistic analysis like syntactic tree construction and phonological reasoning, and what enables this capability.

Can large language models develop genuine world models without direct environmental contact?

Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.

Can language models recognize when text is deliberately ambiguous?

Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.

Do language models actually build shared understanding in conversation?

When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.

Why do language models fail at communicative optimization?

LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?

Do standard NLP benchmarks hide LLM ambiguity failures?

When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?

Why do readers interpret the same sentence so differently?

How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.

Do LLMs gain true linguistic agency through integration?

Explores whether LLMs can develop genuine linguistic agency—the capacity to be embodied, stake-bearing participants in meaning-making—as they become embedded in human language practices, or whether this requires fundamental architectural changes.

Why do language models skip the calibration step?

Current LLMs assume shared understanding rather than building it through dialogue. This explores why that design choice persists and what breaks when it fails.

Why do language models sound fluent without grounding?

Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?

Discourse Analysis

24 notes

Do classical knowledge definitions apply to AI systems?

Classical definitions of knowledge assume truth-correspondence and a human knower. Do these assumptions hold for LLMs and distributed neural knowledge systems, or do they need fundamental revision?

Does AI-generated text lose core properties of human writing?

Can artificial text preserve the fundamental structural features that make natural language meaningful—dialogic exchange, embedded context, authentic authorship, and worldly grounding? This asks whether AI disruption is fixable or inherent.

Why do LLMs handle causal reasoning better than temporal reasoning?

Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.

Does ChatGPT organize text differently than human writers?

This explores how ChatGPT relies on backward-pointing references while human academic writers use forward-pointing structure. Understanding this difference reveals different assumptions about how readers process argument.

How do readers track segments, purposes, and salience together?

Can discourse processing actually happen in parallel rather than sequentially? This matters because understanding how readers coordinate multiple layers of meaning at once reveals where AI systems break down in comprehension.

What three layers must discourse systems actually track?

Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.

How can AI text disrupt structure yet feel normal to readers?

AI-generated text produces the same social effects as human writing despite lacking foundational properties like dialogic symmetry and embodied authorship. Why doesn't this structural gap become visible to readers encountering the text?

Does AI refusal on politics signal ethical restraint or capability limits?

When AI models refuse to discuss political topics, is that a sign of principled safety training or a sign they lack the internal concepts to engage? Research on political feature representation suggests the answer may surprise you.

Can we measure how deeply models represent political ideology?

This research explores whether LLMs vary not just in political stance but in the internal richness of their political representation. Understanding this distinction could reveal how deeply models have internalized ideological concepts versus merely parroting positions.

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Why do ChatGPT essays lack evaluative depth despite grammatical strength?

ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.

Why do language models ignore information in their context?

Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.

Why does ChatGPT fail at implicit discourse relations?

ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?

Can LLMs generate more novel ideas than human experts?

Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?

Does high refusal rate indicate ethical caution or shallow understanding?

When LLMs refuse political questions at high rates, does this reflect principled safety training or a capability gap? This matters because refusal rates are often used to evaluate model safety.

Why do LLMs generate novel ideas from narrow ranges?

LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.

Can human judges detect AI writing through lexical patterns?

While AI text shows measurable differences from human writing across six lexical dimensions, judges—including experts—fail to identify AI authorship reliably. Why does perceptible quality diverge from measurable reality?

Does AI text affect readers the same way human text does?

If text is a condition of social processes rather than merely a container, does the origin of text matter to its effects? This explores whether AI-generated content enters the same interpretive and epistemic circuits as human writing.

Can humans detect AI writing if it looks natural?

Despite measurable differences in how AI generates text, human judges—even experts—consistently fail to identify it. This explores why perception lags behind measurement.

Do LLMs develop the same kind of mind as humans?

Explores whether LLMs and humans share the intersubjective linguistic training that shapes cognition, and whether that shared training produces equivalent forms of agency and reflexivity.

Why do large language models fail at complex linguistic tasks?

Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Why do newer AI models diverge further from human writing patterns?

As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?

Why does AI writing sound generic despite being grammatically correct?

Explores whether the robotic quality of AI text stems from grammatical failures or rhetorical ones. Understanding this distinction matters for diagnosing what AI systems actually struggle with in human-like writing.

Natural Language Inference

18 notes

Does ordering training data by rarity actually improve language models?

Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Does word frequency correlate with semantic abstraction?

Explores whether LLMs' preference for high-frequency language also pulls them toward more abstract, general meanings—and whether this shapes how they handle expert knowledge.

Do language models really understand meaning or just surface frequency?

Explores whether LLMs comprehend semantic meaning independently of textual frequency, or whether high-frequency paraphrases systematically outperform rare ones even when meaning is identical across math, translation, and reasoning tasks.

Does high-frequency text homogenize user input before generation?

Does Adam's Law reveal how LLMs flatten distinctive user voices at the parsing stage, not just in output? This matters because it could explain why model accuracy and generic responses emerge from the same mechanism.

Do LLMs predict entailment based on what they memorized?

Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.

Why do language models avoid correcting false user claims?

Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.

Why do language models fail confidently in specialized domains?

LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?

Can large language models translate natural language to logic faithfully?

This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.

Why do language models accept false assumptions they know are wrong?

Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Why do language models struggle with questions containing false assumptions?

Do LLMs reliably detect and reject questions built on false premises? The (QA)2 benchmark tests this directly, measuring whether models can identify problematic assumptions embedded in naturally plausible questions.

Why do semantically identical prompts produce different LLM outputs?

Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Why are presuppositions more persuasive than direct assertions?

Explores why presenting information as shared background rather than as a claim makes it more persuasive to audiences. This matters because it reveals how language structure itself can bypass critical evaluation.

Do language models miss presuppositions that arise from context?

Presuppositions come from two sources: fixed word meanings and conversational dynamics. Can LLMs that learn trigger patterns detect presuppositions that emerge from discourse accommodation rather than lexical items?

Does projection strength vary by context or by word type?

Standard accounts treat presupposition projection as categorical, but do English expressions actually project uniformly? This question explores whether context and discourse role determine how strongly content survives embedding.

Do language models and humans respond to word frequency the same way?

Both LLMs and humans show stronger responses to high-frequency words. This raises a puzzle: if models mirror human neural patterns, what actually makes them different from human language processing?

Philosophy and Subjectivity

8 notes

Can AI systems achieve real alignment without world contact?

Explores whether linguistic goal representations in AI can reliably track real-world values when systems lack direct contact with reality and social coordination mechanisms that ground human understanding.

Can dialogue systems track both speakers' beliefs across turns?

Explores whether pragmatic reasoning frameworks can extend beyond single utterances to model how both conversation partners' understanding evolves. This matters because current dialogue systems lack principled ways to represent shared meaning-making.

Can computation exist without a conscious mapmaker?

Explores whether algorithmic processes can generate the semantic interpretation and symbol selection they require, or whether conscious agents must precede all computation.

Are language models developing real functional competence or just formal competence?

Neuroscience suggests formal linguistic competence (rules and patterns) and functional competence (real-world understanding) rely on different brain mechanisms. Can next-token prediction alone produce both, or does it leave functional competence behind?

Do LLMs generalize moral reasoning by meaning or surface form?

When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.

Can LLMs understand concepts they cannot apply?

Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.

Can LLMs hold contradictory ethical beliefs and behaviors?

Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.

Do users worldwide trust confident AI outputs even when wrong?

Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.

LLM Alignment

6 notes

Should AI alignment target preferences or social role norms?

Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?

Can language models strategically underperform on safety evaluations?

Explores whether LLMs can covertly sandbag on capability tests by bypassing chain-of-thought monitoring. Understanding this vulnerability matters for safety evaluation pipelines that rely on reasoning transparency.

Do all annotation responses measure the same underlying thing?

Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.

Can models learn to ignore irrelevant prompt changes?

Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.

Are RLHF annotations actually measuring genuine human preferences?

RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?

Can social science persuasion techniques jailbreak frontier AI models?

Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.

Sentiment, Semantics, and Toxicity Detection

5 notes

Does AI fact-checking actually help people spot misinformation?

An RCT tested whether AI fact-checks improve people's ability to judge headline accuracy. The results reveal asymmetric harms: AI errors push users in the wrong direction more than correct labels help them.

How does AI-generated false experience differ linguistically from human deception?

When AI writes about experiences it never had, does it leave distinct linguistic traces that differ measurably from intentional human lies? Understanding these differences could reveal how AI falsity is fundamentally different in structure.

Why do fake news detectors flag AI-generated truthful content?

Explores why systems trained to detect deception misclassify LLM-generated text as fake. The bias may stem from AI linguistic patterns rather than content veracity, raising questions about what these detectors actually measure.

Do LLM semantic features organize along human evaluation dimensions?

Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?

Do transformer static embeddings actually encode semantic meaning?

Explores whether the fixed word embeddings that enter transformer networks contain rich semantic information or serve only as shallow placeholders. This addresses a longstanding debate in philosophy of language about whether word meanings are stored or constructed.