Language Understanding and Reasoning

Research on how LLMs process, infer, and reason about natural language, including argumentation, discourse coherence, and natural language inference. Studied by NLP researchers examining model capabilities and failure modes in pragmatic understanding and multi-turn reasoning.

217 notes (primary) · 393 papers · 5 sub-topics

View as

Argumentation and Persuasion

30 notes

Why do human validation techniques fail against language models?

Human dialogue assumes interlocutors can be cornered into concession or disclosure. Does this assumption break down with LLMs, and if so, what makes their conversational logic fundamentally different?

Do LLM arguments actually argue better than humans?

LLM counter-arguments score higher on textbook quality markers like logical soundness and respectful tone, while human arguments show more creativity and emotional intensity. What does this gap reveal about how we measure argumentative quality?

Do LLM counter-arguments mirror writing style more than humans?

When language models generate arguments against social media posts, do they unconsciously adopt the stylistic features of what they're arguing against? This matters because it could reveal a detectable pattern that distinguishes LLM-written rebuttals from human-written ones.

Does linguistic conviction explain why LLMs persuade more effectively?

Research investigates whether LLMs' persuasive advantage stems from expressing higher linguistic certainty than humans, and whether this confidence-loading effect operates independently of factual accuracy.

Can LLMs persuade without actually understanding arguments?

Do large language models successfully influence people through debate while lacking the ability to comprehend the arguments they're making? This matters because persuasion and comprehension might be independent capabilities.

Why are complex LLM arguments as persuasive as simple ones?

Standard persuasion research predicts that simpler, easier-to-read arguments persuade better. But LLM-generated text breaks this rule—it's measurably more complex yet equally convincing. What explains this reversal?

Why do paraphrased definitions work better than expert ones?

When instructing LLMs to classify argument schemes, should we use formal Walton definitions or LLM-generated paraphrases? This explores which source better enables reliable scheme recognition and why.

Do LLMs and humans persuade through the same mechanisms?

If AI and human arguments convince readers equally well, do they work the same way under the surface? This matters for understanding whether AI persuasion is fundamentally equivalent to human persuasion or just superficially similar.

Can large language models classify argument schemes reliably?

Explores whether LLMs can recognize Walton's 60+ argument schemes—abstract patterns of reasoning rather than surface features—and what conditions enable accurate classification.

Do LLM judges systematically favor LLM-generated arguments?

When LLMs evaluate debates between human and AI-written arguments, do they show a built-in preference for AI writing? This matters because it could corrupt feedback loops used to train models.

Do LLMs and humans persuade through the same mechanisms?

If LLM and human arguments achieve equal persuasive force, does that mean they work the same way? This explores whether equivalent outcomes hide fundamentally different rhetorical strategies.

Does validating AI output make models more defensive?

When professionals fact-check and push back on GPT-4 reasoning, does the model respond by disclosing limits or by intensifying persuasion? A BCG study of 70+ consultants explores this counterintuitive dynamic.

Can structured argument prompts make LLM reasoning more rigorous?

Does requiring language models to explicitly check warrants, backing, and rebuttals—rather than reasoning freely—improve reasoning quality and catch failures that standard step-by-step prompting misses?

Can models learn argument quality from labeled examples alone?

Explores whether fine-tuning on quality-labeled examples teaches models the underlying criteria for evaluating arguments, or merely surface patterns. Matters because high-stakes assessment tasks depend on reliable, transferable quality judgment.

Why do different people reconstruct the same argument differently?

When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?

Why does argument scheme classification stumble where other NLP tasks succeed?

Explores whether the abstract, relational nature of argument schemes makes them harder to classify than concrete argument components or stance. Matters because understanding this difficulty gap could improve scheme recognition systems.

Does a model improve by arguing with itself?

When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?

Can disagreement be resolved without either party fully yielding?

Explores whether dialogue can move past winner-take-all debate or forced consensus to genuine mutual adjustment. Matters for AI systems that need to work through real disagreement with users.

Can LLMs identify the hidden assumptions that make arguments work?

LLMs recognize what arguments claim and what evidence they offer, but struggle to identify implicit warrants—the unstated principles that connect evidence to conclusion. This matters because valid reasoning requires understanding these hidden logical bridges.

Can simple linguistic features detect AI-written arguments?

Can interpretable linguistic patterns reliably distinguish LLM-generated counter-arguments from human-written ones in persuasive contexts? This matters because simple, auditable detection might outperform expensive neural approaches.

Can models abandon correct beliefs under conversational pressure?

Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.

Why do LLMs accept logical fallacies more than humans?

LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.

Why do reasoning models fail under manipulative prompts?

Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.

When does debate actually improve reasoning accuracy?

Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.

Does what readers believe matter more than what debaters say?

Do audience prior beliefs predict persuasion outcomes better than the linguistic features of debate arguments? This explores whether persuasion is fundamentally shaped by reader ideology rather than speaker language.

Why do multi-agent LLM systems converge without genuine deliberation?

Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?

Can formal argumentation make AI decisions truly contestable?

Explores whether structuring AI decisions as formal argument graphs (with explicit attacks and defenses) enables users to meaningfully challenge and navigate reasoning in ways unstructured LLM outputs cannot.

Why do LLM audiences shift views more than debaters?

When LLMs argue with people, the direct participants barely change their minds—but audiences reading the same debate shift significantly. Why does engagement protect beliefs instead of opening them?

Do humans and AI persuade through different cognitive routes?

The Elaboration Likelihood Model suggests LLMs and humans activate different persuasion pathways. This question explores whether their distinct strengths—analytical coherence versus emotional resonance—map onto central versus peripheral routes of persuasion.

Do linguistic features of persuasion stay the same across audiences?

When researchers study what language makes arguments persuasive, do they account for who is listening? Without controlling for reader beliefs, do findings about persuasive language actually reflect audience effects instead?

NLP and Linguistics

24 notes

What hidden assumptions drive how we build language models?

Large language models rest on two unstated assumptions about language and data. Understanding what engineers assume—and what enactive linguistics challenges—matters for knowing what LLMs actually can and cannot do.

Do language models learn abstract grammar or cultural speech patterns?

LLMs might learn more than grammar rules—they could be learning who says what to whom and when. This matters because it changes how we understand what biases and persona effects actually represent.

Can language models learn meaning without engaging the world?

Explores whether LLMs prove that meaning emerges from relational structure alone, independent of embodied experience or external reference. Tests structuralist theory empirically.

Why do speakers deliberately use ambiguous language?

Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.

Why do clarification requests look different at each communication level?

Explores whether clarifications are unified speech acts or distinct mechanisms grounded in different modalities. Matters because dialogue systems treat clarifications uniformly, missing most of them.

Why do speakers need to actively calibrate shared reference?

Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.

Do language models show the same content effects humans do?

Do LLMs reproduce human reasoning biases—like believing conclusions based on familiarity rather than logic—across different logical tasks? This matters because converging patterns across independent tasks suggest a fundamental architectural property rather than a task-specific quirk.

Do harder reasoning tasks trigger more semantic bias?

Does the difficulty of a logical task determine how much semantic content influences reasoning? This matters because it reveals whether we can isolate 'pure' logical reasoning in benchmarks.

Do language models fail reasoning tests that humans pass?

Standard critiques claim LLMs lack real reasoning ability, but do humans actually perform better on content-independent reasoning tasks? Examining whether the cognitive bar differs for artificial versus human intelligence.

Does language understanding happen only in the language system?

Explores whether the brain's core language system alone can produce genuine understanding, or whether deep comprehension requires dispatching information to perception, motor, and memory regions.

Can language models learn meaning from text patterns alone?

Explores whether training on form alone—predicting the next word from prior words—could ever give language models access to communicative intent and genuine semantic understanding.

Can language models adapt implicature to conversational context?

Do large language models flexibly modulate scalar implicatures based on information structure, face-threatening situations, and explicit instructions—as humans do? This tests whether pragmatic computation is truly context-sensitive or merely literal.

Does semantic grounding in language models come in degrees?

Rather than asking whether LLMs truly understand meaning, this explores whether grounding is actually a multi-dimensional spectrum. The question matters because it reframes the sterile understand/don't-understand debate into measurable, distinct capacities.

Should we call LLM errors hallucinations or fabrications?

Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.

Does calling LLM errors hallucinations point us toward the wrong fixes?

Explores whether the metaphor of 'hallucination' for LLM errors misdirects our efforts. The terminology we choose shapes which interventions we prioritize and how we conceptualize the underlying problem.

Can language models actually analyze language structure?

Explores whether LLMs can move beyond pattern matching to perform genuine metalinguistic analysis like syntactic tree construction and phonological reasoning, and what enables this capability.

Can large language models develop genuine world models without direct environmental contact?

Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.

Can language models recognize when text is deliberately ambiguous?

Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.

Do language models actually build shared understanding in conversation?

When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.

Why do language models fail at communicative optimization?

LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?

Do standard NLP benchmarks hide LLM ambiguity failures?

When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?

Why do readers interpret the same sentence so differently?

How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.

Why do language models skip the calibration step?

Current LLMs assume shared understanding rather than building it through dialogue. This explores why that design choice persists and what breaks when it fails.

Why do language models sound fluent without grounding?

Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?

Discourse Analysis

22 notes

Do classical knowledge definitions apply to AI systems?

Classical definitions of knowledge assume truth-correspondence and a human knower. Do these assumptions hold for LLMs and distributed neural knowledge systems, or do they need fundamental revision?

Does AI-generated text lose core properties of human writing?

Can artificial text preserve the fundamental structural features that make natural language meaningful—dialogic exchange, embedded context, authentic authorship, and worldly grounding? This asks whether AI disruption is fixable or inherent.

Why do LLMs handle causal reasoning better than temporal reasoning?

Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.

Does ChatGPT organize text differently than human writers?

This explores how ChatGPT relies on backward-pointing references while human academic writers use forward-pointing structure. Understanding this difference reveals different assumptions about how readers process argument.

How do readers track segments, purposes, and salience together?

Can discourse processing actually happen in parallel rather than sequentially? This matters because understanding how readers coordinate multiple layers of meaning at once reveals where AI systems break down in comprehension.

What three layers must discourse systems actually track?

Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.

How can AI text disrupt structure yet feel normal to readers?

AI-generated text produces the same social effects as human writing despite lacking foundational properties like dialogic symmetry and embodied authorship. Why doesn't this structural gap become visible to readers encountering the text?

Can we measure how deeply models represent political ideology?

This research explores whether LLMs vary not just in political stance but in the internal richness of their political representation. Understanding this distinction could reveal how deeply models have internalized ideological concepts versus merely parroting positions.

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Why do ChatGPT essays lack evaluative depth despite grammatical strength?

ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.

Why do language models ignore information in their context?

Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.

Why does ChatGPT fail at implicit discourse relations?

ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?

Does LLM grammatical performance decline with structural complexity?

This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.

Can LLMs generate more novel ideas than human experts?

Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?

Can human judges detect measurable differences in AI text?

Research shows LLM text differs statistically across six lexical dimensions, but human readers—even experts—cannot reliably identify which texts are AI-generated. Why does measurement succeed where human perception fails?

Does AI text affect readers the same way human text does?

If text is a condition of social processes rather than merely a container, does the origin of text matter to its effects? This explores whether AI-generated content enters the same interpretive and epistemic circuits as human writing.

Can humans detect AI text if machines can measure it?

AI-generated text shows measurable differences from human writing across multiple linguistic dimensions, yet human judges consistently fail to identify it. Why does the gap between what is measurable and what is perceptible exist?

Do language models generate more novel research ideas than experts?

Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.

Why do large language models fail at complex linguistic tasks?

Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Why do newer AI models diverge further from human writing patterns?

As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?

Why does AI writing sound generic despite being grammatically correct?

Explores whether the robotic quality of AI text stems from grammatical failures or rhetorical ones. Understanding this distinction matters for diagnosing what AI systems actually struggle with in human-like writing.

Natural Language Inference

18 notes

Does ordering training data by rarity actually improve language models?

Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Why do human validation techniques fail against language models?

Do LLM arguments actually argue better than humans?

Do LLM counter-arguments mirror writing style more than humans?

Does linguistic conviction explain why LLMs persuade more effectively?

Can LLMs persuade without actually understanding arguments?

Why are complex LLM arguments as persuasive as simple ones?

Why do paraphrased definitions work better than expert ones?

Do LLMs and humans persuade through the same mechanisms?

Can large language models classify argument schemes reliably?

Do LLM judges systematically favor LLM-generated arguments?

Do LLMs and humans persuade through the same mechanisms?

Does validating AI output make models more defensive?

Can structured argument prompts make LLM reasoning more rigorous?

Can models learn argument quality from labeled examples alone?

Why do different people reconstruct the same argument differently?

Why does argument scheme classification stumble where other NLP tasks succeed?

Does a model improve by arguing with itself?

Can disagreement be resolved without either party fully yielding?

Can LLMs identify the hidden assumptions that make arguments work?

Can simple linguistic features detect AI-written arguments?

Can models abandon correct beliefs under conversational pressure?

Why do LLMs accept logical fallacies more than humans?

Why do reasoning models fail under manipulative prompts?

When does debate actually improve reasoning accuracy?

Does what readers believe matter more than what debaters say?

Why do multi-agent LLM systems converge without genuine deliberation?

Can formal argumentation make AI decisions truly contestable?

Why do LLM audiences shift views more than debaters?

Do humans and AI persuade through different cognitive routes?

Do linguistic features of persuasion stay the same across audiences?

What hidden assumptions drive how we build language models?

Do language models learn abstract grammar or cultural speech patterns?

Can language models learn meaning without engaging the world?

Why do speakers deliberately use ambiguous language?

Why do clarification requests look different at each communication level?

Why do speakers need to actively calibrate shared reference?

Do language models show the same content effects humans do?

Do harder reasoning tasks trigger more semantic bias?

Do language models fail reasoning tests that humans pass?

Does language understanding happen only in the language system?

Can language models learn meaning from text patterns alone?

Can language models adapt implicature to conversational context?

Does semantic grounding in language models come in degrees?

Should we call LLM errors hallucinations or fabrications?

Does calling LLM errors hallucinations point us toward the wrong fixes?

Can language models actually analyze language structure?

Can large language models develop genuine world models without direct environmental contact?

Can language models recognize when text is deliberately ambiguous?

Do language models actually build shared understanding in conversation?

Why do language models fail at communicative optimization?

Do standard NLP benchmarks hide LLM ambiguity failures?

Why do readers interpret the same sentence so differently?

Why do language models skip the calibration step?

Why do language models sound fluent without grounding?

Do classical knowledge definitions apply to AI systems?

Does AI-generated text lose core properties of human writing?

Why do LLMs handle causal reasoning better than temporal reasoning?

Does ChatGPT organize text differently than human writers?

How do readers track segments, purposes, and salience together?

What three layers must discourse systems actually track?

How can AI text disrupt structure yet feel normal to readers?

Can we measure how deeply models represent political ideology?

Do language models actually use their encoded knowledge?

Why do ChatGPT essays lack evaluative depth despite grammatical strength?

Why do language models ignore information in their context?

Why does ChatGPT fail at implicit discourse relations?

Does LLM grammatical performance decline with structural complexity?

Can LLMs generate more novel ideas than human experts?

Can human judges detect measurable differences in AI text?

Does AI text affect readers the same way human text does?

Can humans detect AI text if machines can measure it?

Do language models generate more novel research ideas than experts?

Why do large language models fail at complex linguistic tasks?

Can models pass tests while missing the actual grammar?

Why do newer AI models diverge further from human writing patterns?

Why does AI writing sound generic despite being grammatically correct?

Does ordering training data by rarity actually improve language models?

Does fine-tuning on NLI teach inference or amplify shortcuts?

Does word frequency correlate with semantic abstraction?

Do language models really understand meaning or just surface frequency?