Language, Text, and Discourse

Research on how language models process natural language and text — linguistics and inference, argumentation, sentiment, discourse, reading and summarization, and social-media and web text.

132 notes (primary) · 508 papers · 8 sub-topics

View as

Argumentation and Persuasion

18 notes

Do LLM arguments actually argue better than humans?

LLM counter-arguments score higher on textbook quality markers like logical soundness and respectful tone, while human arguments show more creativity and emotional intensity. What does this gap reveal about how we measure argumentative quality?

Do LLM counter-arguments mirror writing style more than humans?

When language models generate arguments against social media posts, do they unconsciously adopt the stylistic features of what they're arguing against? This matters because it could reveal a detectable pattern that distinguishes LLM-written rebuttals from human-written ones.

Can LLMs persuade without actually understanding arguments?

Do large language models successfully influence people through debate while lacking the ability to comprehend the arguments they're making? This matters because persuasion and comprehension might be independent capabilities.

Why are complex LLM arguments as persuasive as simple ones?

Standard persuasion research predicts that simpler, easier-to-read arguments persuade better. But LLM-generated text breaks this rule—it's measurably more complex yet equally convincing. What explains this reversal?

Do LLMs and humans persuade through the same mechanisms?

If AI and human arguments convince readers equally well, do they work the same way under the surface? This matters for understanding whether AI persuasion is fundamentally equivalent to human persuasion or just superficially similar.

Can large language models classify argument schemes reliably?

Explores whether LLMs can recognize Walton's 60+ argument schemes—abstract patterns of reasoning rather than surface features—and what conditions enable accurate classification.

Do LLMs use moral language more than humans?

This explores whether large language models rely more heavily on appeals to care, fairness, authority, and sanctity than human arguers do, and whether this difference persists when emotional tone remains equivalent.

Do LLM judges systematically favor LLM-generated arguments?

When LLMs evaluate debates between human and AI-written arguments, do they show a built-in preference for AI writing? This matters because it could corrupt feedback loops used to train models.

Do LLMs and humans persuade through the same mechanisms?

If LLM and human arguments achieve equal persuasive force, does that mean they work the same way? This explores whether equivalent outcomes hide fundamentally different rhetorical strategies.

Can models learn argument quality from labeled examples alone?

Explores whether fine-tuning on quality-labeled examples teaches models the underlying criteria for evaluating arguments, or merely surface patterns. Matters because high-stakes assessment tasks depend on reliable, transferable quality judgment.

Why do different people reconstruct the same argument differently?

When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?

Why does argument scheme classification stumble where other NLP tasks succeed?

Explores whether the abstract, relational nature of argument schemes makes them harder to classify than concrete argument components or stance. Matters because understanding this difficulty gap could improve scheme recognition systems.

Can LLMs identify the hidden assumptions that make arguments work?

LLMs recognize what arguments claim and what evidence they offer, but struggle to identify implicit warrants—the unstated principles that connect evidence to conclusion. This matters because valid reasoning requires understanding these hidden logical bridges.

Can simple linguistic features detect AI-written arguments?

Can interpretable linguistic patterns reliably distinguish LLM-generated counter-arguments from human-written ones in persuasive contexts? This matters because simple, auditable detection might outperform expensive neural approaches.

Does what readers believe matter more than what debaters say?

Do audience prior beliefs predict persuasion outcomes better than the linguistic features of debate arguments? This explores whether persuasion is fundamentally shaped by reader ideology rather than speaker language.

Can formal argumentation make AI decisions truly contestable?

Explores whether structuring AI decisions as formal argument graphs (with explicit attacks and defenses) enables users to meaningfully challenge and navigate reasoning in ways unstructured LLM outputs cannot.

Why do LLM audiences shift views more than debaters?

When LLMs argue with people, the direct participants barely change their minds—but audiences reading the same debate shift significantly. Why does engagement protect beliefs instead of opening them?

Do linguistic features of persuasion stay the same across audiences?

When researchers study what language makes arguments persuasive, do they account for who is listening? Without controlling for reader beliefs, do findings about persuasive language actually reflect audience effects instead?

NLP and Linguistics

14 notes

What hidden assumptions drive how we build language models?

Large language models rest on two unstated assumptions about language and data. Understanding what engineers assume—and what enactive linguistics challenges—matters for knowing what LLMs actually can and cannot do.

Do language models learn abstract grammar or cultural speech patterns?

LLMs might learn more than grammar rules—they could be learning who says what to whom and when. This matters because it changes how we understand what biases and persona effects actually represent.

Can language models learn meaning without engaging the world?

Explores whether LLMs prove that meaning emerges from relational structure alone, independent of embodied experience or external reference. Tests structuralist theory empirically.

Why do speakers deliberately use ambiguous language?

Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.

Why do speakers need to actively calibrate shared reference?

Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.

Can language models adapt implicature to conversational context?

Do large language models flexibly modulate scalar implicatures based on information structure, face-threatening situations, and explicit instructions—as humans do? This tests whether pragmatic computation is truly context-sensitive or merely literal.

Should we call LLM errors hallucinations or fabrications?

Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.

Can language models actually analyze language structure?

Explores whether LLMs can move beyond pattern matching to perform genuine metalinguistic analysis like syntactic tree construction and phonological reasoning, and what enables this capability.

Can large language models develop genuine world models without direct environmental contact?

Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.

Can language models recognize when text is deliberately ambiguous?

Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.

Do language models actually build shared understanding in conversation?

When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.

Why do language models fail at communicative optimization?

LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?

Do standard NLP benchmarks hide LLM ambiguity failures?

When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?

Why do readers interpret the same sentence so differently?

How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.

Discourse Analysis

14 notes

Why do LLMs generate ideas the research community already explores?

LLMs inherit the distribution of published literature, concentrating ideation where researchers have already invested conceptual effort. This raises a core question: can AI ideation complement rather than duplicate human research directions?

Does AI-generated text lose core properties of human writing?

Can artificial text preserve the fundamental structural features that make natural language meaningful—dialogic exchange, embedded context, authentic authorship, and worldly grounding? This asks whether AI disruption is fixable or inherent.

Does ChatGPT organize text differently than human writers?

This explores how ChatGPT relies on backward-pointing references while human academic writers use forward-pointing structure. Understanding this difference reveals different assumptions about how readers process argument.

How do readers track segments, purposes, and salience together?

Can discourse processing actually happen in parallel rather than sequentially? This matters because understanding how readers coordinate multiple layers of meaning at once reveals where AI systems break down in comprehension.

What three layers must discourse systems actually track?

Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.

How can AI text disrupt structure yet feel normal to readers?

AI-generated text produces the same social effects as human writing despite lacking foundational properties like dialogic symmetry and embodied authorship. Why doesn't this structural gap become visible to readers encountering the text?

Can we measure how deeply models represent political ideology?

This research explores whether LLMs vary not just in political stance but in the internal richness of their political representation. Understanding this distinction could reveal how deeply models have internalized ideological concepts versus merely parroting positions.

Why do ChatGPT essays lack evaluative depth despite grammatical strength?

ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.

Why does ChatGPT fail at implicit discourse relations?

ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?

Can human judges detect measurable differences in AI text?

Research shows LLM text differs statistically across six lexical dimensions, but human readers—even experts—cannot reliably identify which texts are AI-generated. Why does measurement succeed where human perception fails?

Does AI text affect readers the same way human text does?

If text is a condition of social processes rather than merely a container, does the origin of text matter to its effects? This explores whether AI-generated content enters the same interpretive and epistemic circuits as human writing.

Can humans detect AI text if machines can measure it?

AI-generated text shows measurable differences from human writing across multiple linguistic dimensions, yet human judges consistently fail to identify it. Why does the gap between what is measurable and what is perceptible exist?

Why do newer AI models diverge further from human writing patterns?

As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?

Why does AI writing sound generic despite being grammatically correct?

Explores whether the robotic quality of AI text stems from grammatical failures or rhetorical ones. Understanding this distinction matters for diagnosing what AI systems actually struggle with in human-like writing.

Natural Language Inference

13 notes

Does ordering training data by rarity actually improve language models?

Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.

Does word frequency correlate with semantic abstraction?

Explores whether LLMs' preference for high-frequency language also pulls them toward more abstract, general meanings—and whether this shapes how they handle expert knowledge.

Do language models really understand meaning or just surface frequency?

Explores whether LLMs comprehend semantic meaning independently of textual frequency, or whether high-frequency paraphrases systematically outperform rare ones even when meaning is identical across math, translation, and reasoning tasks.

Does high-frequency text homogenize user input before generation?

Does Adam's Law reveal how LLMs flatten distinctive user voices at the parsing stage, not just in output? This matters because it could explain why model accuracy and generic responses emerge from the same mechanism.

Why do language models fail confidently in specialized domains?

LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?

Can large language models translate natural language to logic faithfully?

This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.

Why do language models accept false assumptions they know are wrong?

Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.

Why do semantically identical prompts produce different LLM outputs?

Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Why are presuppositions more persuasive than direct assertions?

Explores why presenting information as shared background rather than as a claim makes it more persuasive to audiences. This matters because it reveals how language structure itself can bypass critical evaluation.

Do language models miss presuppositions that arise from context?

Presuppositions come from two sources: fixed word meanings and conversational dynamics. Can LLMs that learn trigger patterns detect presuppositions that emerge from discourse accommodation rather than lexical items?

Does projection strength vary by context or by word type?

Standard accounts treat presupposition projection as categorical, but do English expressions actually project uniformly? This question explores whether context and discourse role determine how strongly content survives embedding.

Do language models and humans respond to word frequency the same way?

Both LLMs and humans show stronger responses to high-frequency words. This raises a puzzle: if models mirror human neural patterns, what actually makes them different from human language processing?

Sentiment, Semantics, and Toxicity Detection

5 notes

Does AI fact-checking actually help people spot misinformation?

An RCT tested whether AI fact-checks improve people's ability to judge headline accuracy. The results reveal asymmetric harms: AI errors push users in the wrong direction more than correct labels help them.

How does AI-generated false experience differ linguistically from human deception?

When AI writes about experiences it never had, does it leave distinct linguistic traces that differ measurably from intentional human lies? Understanding these differences could reveal how AI falsity is fundamentally different in structure.

Why do fake news detectors flag AI-generated truthful content?

Fake news detectors may systematically misclassify LLM-generated text as deceptive. We explore whether this bias stems from detecting AI style rather than actual falsehood, and what that means for detection accuracy.

Do LLM semantic features organize along human evaluation dimensions?

Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?

Do transformer static embeddings actually encode semantic meaning?

Explores whether the fixed word embeddings that enter transformer networks contain rich semantic information or serve only as shallow placeholders. This addresses a longstanding debate in philosophy of language about whether word meanings are stored or constructed.

Browser Agents

1 note

Can LLMs predict demographics from social media usernames alone?

This explores whether web-browsing language models can infer personal attributes like gender, age, and political orientation from just a username and public profile. The finding matters because it reveals a privacy vulnerability that traditional API-based assumptions didn't anticipate.

Do LLM arguments actually argue better than humans?

Do LLM counter-arguments mirror writing style more than humans?

Can LLMs persuade without actually understanding arguments?

Why are complex LLM arguments as persuasive as simple ones?

Do LLMs and humans persuade through the same mechanisms?

Can large language models classify argument schemes reliably?

Do LLMs use moral language more than humans?

Do LLM judges systematically favor LLM-generated arguments?

Do LLMs and humans persuade through the same mechanisms?

Can models learn argument quality from labeled examples alone?

Why do different people reconstruct the same argument differently?

Why does argument scheme classification stumble where other NLP tasks succeed?

Can LLMs identify the hidden assumptions that make arguments work?

Can simple linguistic features detect AI-written arguments?

Does what readers believe matter more than what debaters say?

Can formal argumentation make AI decisions truly contestable?

Why do LLM audiences shift views more than debaters?

Do linguistic features of persuasion stay the same across audiences?

What hidden assumptions drive how we build language models?

Do language models learn abstract grammar or cultural speech patterns?

Can language models learn meaning without engaging the world?

Why do speakers deliberately use ambiguous language?

Why do speakers need to actively calibrate shared reference?

Can language models adapt implicature to conversational context?

Should we call LLM errors hallucinations or fabrications?

Can language models actually analyze language structure?

Can large language models develop genuine world models without direct environmental contact?

Can language models recognize when text is deliberately ambiguous?

Do language models actually build shared understanding in conversation?

Why do language models fail at communicative optimization?

Do standard NLP benchmarks hide LLM ambiguity failures?

Why do readers interpret the same sentence so differently?

Why do LLMs generate ideas the research community already explores?

Does AI-generated text lose core properties of human writing?

Does ChatGPT organize text differently than human writers?

How do readers track segments, purposes, and salience together?

What three layers must discourse systems actually track?

How can AI text disrupt structure yet feel normal to readers?

Can we measure how deeply models represent political ideology?

Why do ChatGPT essays lack evaluative depth despite grammatical strength?

Why does ChatGPT fail at implicit discourse relations?

Can human judges detect measurable differences in AI text?

Does AI text affect readers the same way human text does?

Can humans detect AI text if machines can measure it?

Why do newer AI models diverge further from human writing patterns?

Why does AI writing sound generic despite being grammatically correct?

Does ordering training data by rarity actually improve language models?

Does word frequency correlate with semantic abstraction?

Do language models really understand meaning or just surface frequency?

Does high-frequency text homogenize user input before generation?

Why do language models fail confidently in specialized domains?

Can large language models translate natural language to logic faithfully?

Why do language models accept false assumptions they know are wrong?

Why do semantically identical prompts produce different LLM outputs?

Why do embedding contexts confuse LLM entailment predictions?

Why are presuppositions more persuasive than direct assertions?

Do language models miss presuppositions that arise from context?

Does projection strength vary by context or by word type?

Do language models and humans respond to word frequency the same way?

Does AI fact-checking actually help people spot misinformation?

How does AI-generated false experience differ linguistically from human deception?

Why do fake news detectors flag AI-generated truthful content?

Do LLM semantic features organize along human evaluation dimensions?

Do transformer static embeddings actually encode semantic meaning?

Does better summary writing actually increase user engagement?

Can LLMs predict demographics from social media usernames alone?