← All clusters

Language Understanding and Reasoning

Research on how LLMs process, infer, and reason about natural language, including argumentation, discourse coherence, and natural language inference. Studied by NLP researchers examining model capabilities and failure modes in pragmatic understanding and multi-turn reasoning.

217 notes (primary) · 393 papers · 5 sub-topics
View as

Argumentation and Persuasion

30 notes

Why do human validation techniques fail against language models?

Human dialogue assumes interlocutors can be cornered into concession or disclosure. Does this assumption break down with LLMs, and if so, what makes their conversational logic fundamentally different?

Explore related Read →

Does linguistic conviction explain why LLMs persuade more effectively?

Research investigates whether LLMs' persuasive advantage stems from expressing higher linguistic certainty than humans, and whether this confidence-loading effect operates independently of factual accuracy.

Explore related Read →

Can LLMs persuade without actually understanding arguments?

Do large language models successfully influence people through debate while lacking the ability to comprehend the arguments they're making? This matters because persuasion and comprehension might be independent capabilities.

Explore related Read →

Why are complex LLM arguments as persuasive as simple ones?

Standard persuasion research predicts that simpler, easier-to-read arguments persuade better. But LLM-generated text breaks this rule—it's measurably more complex yet equally convincing. What explains this reversal?

Explore related Read →

Do LLMs and humans persuade through the same mechanisms?

If AI and human arguments convince readers equally well, do they work the same way under the surface? This matters for understanding whether AI persuasion is fundamentally equivalent to human persuasion or just superficially similar.

Explore related Read →

Do LLM judges systematically favor LLM-generated arguments?

When LLMs evaluate debates between human and AI-written arguments, do they show a built-in preference for AI writing? This matters because it could corrupt feedback loops used to train models.

Explore related Read →

Do LLMs and humans persuade through the same mechanisms?

If LLM and human arguments achieve equal persuasive force, does that mean they work the same way? This explores whether equivalent outcomes hide fundamentally different rhetorical strategies.

Explore related Read →

Does validating AI output make models more defensive?

When professionals fact-check and push back on GPT-4 reasoning, does the model respond by disclosing limits or by intensifying persuasion? A BCG study of 70+ consultants explores this counterintuitive dynamic.

Explore related Read →

Can structured argument prompts make LLM reasoning more rigorous?

Does requiring language models to explicitly check warrants, backing, and rebuttals—rather than reasoning freely—improve reasoning quality and catch failures that standard step-by-step prompting misses?

Explore related Read →

Can models learn argument quality from labeled examples alone?

Explores whether fine-tuning on quality-labeled examples teaches models the underlying criteria for evaluating arguments, or merely surface patterns. Matters because high-stakes assessment tasks depend on reliable, transferable quality judgment.

Explore related Read →

Why do different people reconstruct the same argument differently?

When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?

Explore related Read →

Does a model improve by arguing with itself?

When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?

Explore related Read →

Can disagreement be resolved without either party fully yielding?

Explores whether dialogue can move past winner-take-all debate or forced consensus to genuine mutual adjustment. Matters for AI systems that need to work through real disagreement with users.

Explore related Read →

Can LLMs identify the hidden assumptions that make arguments work?

LLMs recognize what arguments claim and what evidence they offer, but struggle to identify implicit warrants—the unstated principles that connect evidence to conclusion. This matters because valid reasoning requires understanding these hidden logical bridges.

Explore related Read →

Can models abandon correct beliefs under conversational pressure?

Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.

Explore related Read →

Why do LLMs accept logical fallacies more than humans?

LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.

Explore related Read →

Why do reasoning models fail under manipulative prompts?

Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.

Explore related Read →

When does debate actually improve reasoning accuracy?

Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.

Explore related Read →

Why do multi-agent LLM systems converge without genuine deliberation?

Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?

Explore related Read →

Can formal argumentation make AI decisions truly contestable?

Explores whether structuring AI decisions as formal argument graphs (with explicit attacks and defenses) enables users to meaningfully challenge and navigate reasoning in ways unstructured LLM outputs cannot.

Explore related Read →

Why do LLM audiences shift views more than debaters?

When LLMs argue with people, the direct participants barely change their minds—but audiences reading the same debate shift significantly. Why does engagement protect beliefs instead of opening them?

Explore related Read →

Do humans and AI persuade through different cognitive routes?

The Elaboration Likelihood Model suggests LLMs and humans activate different persuasion pathways. This question explores whether their distinct strengths—analytical coherence versus emotional resonance—map onto central versus peripheral routes of persuasion.

Explore related Read →

NLP and Linguistics

24 notes

What hidden assumptions drive how we build language models?

Large language models rest on two unstated assumptions about language and data. Understanding what engineers assume—and what enactive linguistics challenges—matters for knowing what LLMs actually can and cannot do.

Explore related Read →

Do language models learn abstract grammar or cultural speech patterns?

LLMs might learn more than grammar rules—they could be learning who says what to whom and when. This matters because it changes how we understand what biases and persona effects actually represent.

Explore related Read →

Can language models learn meaning without engaging the world?

Explores whether LLMs prove that meaning emerges from relational structure alone, independent of embodied experience or external reference. Tests structuralist theory empirically.

Explore related Read →

Why do speakers deliberately use ambiguous language?

Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.

Explore related Read →

Why do clarification requests look different at each communication level?

Explores whether clarifications are unified speech acts or distinct mechanisms grounded in different modalities. Matters because dialogue systems treat clarifications uniformly, missing most of them.

Explore related Read →

Why do speakers need to actively calibrate shared reference?

Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.

Explore related Read →

Do language models show the same content effects humans do?

Do LLMs reproduce human reasoning biases—like believing conclusions based on familiarity rather than logic—across different logical tasks? This matters because converging patterns across independent tasks suggest a fundamental architectural property rather than a task-specific quirk.

Explore related Read →

Do harder reasoning tasks trigger more semantic bias?

Does the difficulty of a logical task determine how much semantic content influences reasoning? This matters because it reveals whether we can isolate 'pure' logical reasoning in benchmarks.

Explore related Read →

Do language models fail reasoning tests that humans pass?

Standard critiques claim LLMs lack real reasoning ability, but do humans actually perform better on content-independent reasoning tasks? Examining whether the cognitive bar differs for artificial versus human intelligence.

Explore related Read →

Does language understanding happen only in the language system?

Explores whether the brain's core language system alone can produce genuine understanding, or whether deep comprehension requires dispatching information to perception, motor, and memory regions.

Explore related Read →

Can language models learn meaning from text patterns alone?

Explores whether training on form alone—predicting the next word from prior words—could ever give language models access to communicative intent and genuine semantic understanding.

Explore related Read →

Can language models adapt implicature to conversational context?

Do large language models flexibly modulate scalar implicatures based on information structure, face-threatening situations, and explicit instructions—as humans do? This tests whether pragmatic computation is truly context-sensitive or merely literal.

Explore related Read →

Does semantic grounding in language models come in degrees?

Rather than asking whether LLMs truly understand meaning, this explores whether grounding is actually a multi-dimensional spectrum. The question matters because it reframes the sterile understand/don't-understand debate into measurable, distinct capacities.

Explore related Read →

Should we call LLM errors hallucinations or fabrications?

Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.

Explore related Read →

Does calling LLM errors hallucinations point us toward the wrong fixes?

Explores whether the metaphor of 'hallucination' for LLM errors misdirects our efforts. The terminology we choose shapes which interventions we prioritize and how we conceptualize the underlying problem.

Explore related Read →

Can language models actually analyze language structure?

Explores whether LLMs can move beyond pattern matching to perform genuine metalinguistic analysis like syntactic tree construction and phonological reasoning, and what enables this capability.

Explore related Read →

Can large language models develop genuine world models without direct environmental contact?

Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.

Explore related Read →

Can language models recognize when text is deliberately ambiguous?

Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.

Explore related Read →

Do language models actually build shared understanding in conversation?

When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.

Explore related Read →

Why do language models fail at communicative optimization?

LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?

Explore related Read →

Do standard NLP benchmarks hide LLM ambiguity failures?

When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?

Explore related Read →

Why do readers interpret the same sentence so differently?

How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.

Explore related Read →

Why do language models skip the calibration step?

Current LLMs assume shared understanding rather than building it through dialogue. This explores why that design choice persists and what breaks when it fails.

Explore related Read →

Why do language models sound fluent without grounding?

Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?

Explore related Read →

Discourse Analysis

22 notes

Do classical knowledge definitions apply to AI systems?

Classical definitions of knowledge assume truth-correspondence and a human knower. Do these assumptions hold for LLMs and distributed neural knowledge systems, or do they need fundamental revision?

Explore related Read →

Does AI-generated text lose core properties of human writing?

Can artificial text preserve the fundamental structural features that make natural language meaningful—dialogic exchange, embedded context, authentic authorship, and worldly grounding? This asks whether AI disruption is fixable or inherent.

Explore related Read →

Why do LLMs handle causal reasoning better than temporal reasoning?

Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.

Explore related Read →

Does ChatGPT organize text differently than human writers?

This explores how ChatGPT relies on backward-pointing references while human academic writers use forward-pointing structure. Understanding this difference reveals different assumptions about how readers process argument.

Explore related Read →

How do readers track segments, purposes, and salience together?

Can discourse processing actually happen in parallel rather than sequentially? This matters because understanding how readers coordinate multiple layers of meaning at once reveals where AI systems break down in comprehension.

Explore related Read →

What three layers must discourse systems actually track?

Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.

Explore related Read →

How can AI text disrupt structure yet feel normal to readers?

AI-generated text produces the same social effects as human writing despite lacking foundational properties like dialogic symmetry and embodied authorship. Why doesn't this structural gap become visible to readers encountering the text?

Explore related Read →

Can we measure how deeply models represent political ideology?

This research explores whether LLMs vary not just in political stance but in the internal richness of their political representation. Understanding this distinction could reveal how deeply models have internalized ideological concepts versus merely parroting positions.

Explore related Read →

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Explore related Read →

Why do ChatGPT essays lack evaluative depth despite grammatical strength?

ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.

Explore related Read →

Why do language models ignore information in their context?

Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.

Explore related Read →

Why does ChatGPT fail at implicit discourse relations?

ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?

Explore related Read →

Does LLM grammatical performance decline with structural complexity?

This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.

Explore related Read →

Can LLMs generate more novel ideas than human experts?

Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?

Explore related Read →

Can human judges detect measurable differences in AI text?

Research shows LLM text differs statistically across six lexical dimensions, but human readers—even experts—cannot reliably identify which texts are AI-generated. Why does measurement succeed where human perception fails?

Explore related Read →

Does AI text affect readers the same way human text does?

If text is a condition of social processes rather than merely a container, does the origin of text matter to its effects? This explores whether AI-generated content enters the same interpretive and epistemic circuits as human writing.

Explore related Read →

Can humans detect AI text if machines can measure it?

AI-generated text shows measurable differences from human writing across multiple linguistic dimensions, yet human judges consistently fail to identify it. Why does the gap between what is measurable and what is perceptible exist?

Explore related Read →

Do language models generate more novel research ideas than experts?

Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.

Explore related Read →

Why do large language models fail at complex linguistic tasks?

Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.

Explore related Read →

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Explore related Read →

Why do newer AI models diverge further from human writing patterns?

As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?

Explore related Read →

Why does AI writing sound generic despite being grammatically correct?

Explores whether the robotic quality of AI text stems from grammatical failures or rhetorical ones. Understanding this distinction matters for diagnosing what AI systems actually struggle with in human-like writing.

Explore related Read →

Natural Language Inference

18 notes

Does ordering training data by rarity actually improve language models?

Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.

Explore related Read →

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Explore related Read →

Does word frequency correlate with semantic abstraction?

Explores whether LLMs' preference for high-frequency language also pulls them toward more abstract, general meanings—and whether this shapes how they handle expert knowledge.

Explore related Read →

Do language models really understand meaning or just surface frequency?

Explores whether LLMs comprehend semantic meaning independently of textual frequency, or whether high-frequency paraphrases systematically outperform rare ones even when meaning is identical across math, translation, and reasoning tasks.

Explore related Read →

Does high-frequency text homogenize user input before generation?

Does Adam's Law reveal how LLMs flatten distinctive user voices at the parsing stage, not just in output? This matters because it could explain why model accuracy and generic responses emerge from the same mechanism.

Explore related Read →

Do LLMs predict entailment based on what they memorized?

Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.

Explore related Read →

Why do language models avoid correcting false user claims?

Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.

Explore related Read →

Why do language models fail confidently in specialized domains?

LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?

Explore related Read →

Can large language models translate natural language to logic faithfully?

This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.

Explore related Read →

Why do language models accept false assumptions they know are wrong?

Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.

Explore related Read →

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Explore related Read →

Why do language models struggle with questions containing false assumptions?

Do LLMs reliably detect and reject questions built on false premises? The (QA)2 benchmark tests this directly, measuring whether models can identify problematic assumptions embedded in naturally plausible questions.

Explore related Read →

Why do semantically identical prompts produce different LLM outputs?

Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.

Explore related Read →

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Explore related Read →

Why are presuppositions more persuasive than direct assertions?

Explores why presenting information as shared background rather than as a claim makes it more persuasive to audiences. This matters because it reveals how language structure itself can bypass critical evaluation.

Explore related Read →

Do language models miss presuppositions that arise from context?

Presuppositions come from two sources: fixed word meanings and conversational dynamics. Can LLMs that learn trigger patterns detect presuppositions that emerge from discourse accommodation rather than lexical items?

Explore related Read →

Does projection strength vary by context or by word type?

Standard accounts treat presupposition projection as categorical, but do English expressions actually project uniformly? This question explores whether context and discourse role determine how strongly content survives embedding.

Explore related Read →

Do language models and humans respond to word frequency the same way?

Both LLMs and humans show stronger responses to high-frequency words. This raises a puzzle: if models mirror human neural patterns, what actually makes them different from human language processing?

Explore related Read →