← All notes

Where exactly do language models fail at structural language tasks?

How language models handle linguistic structure, discourse coherence, and grounding—and where they systematically fail.

Topic Hub · 43 linked notes · 10 sections
View as

Discourse Structure Theory

6 notes

What three layers must discourse systems actually track?

Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.

Explore related Read →

How do readers track segments, purposes, and salience together?

Can discourse processing actually happen in parallel rather than sequentially? This matters because understanding how readers coordinate multiple layers of meaning at once reveals where AI systems break down in comprehension.

Explore related Read →

Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?

Does encoding linguistic complexity, emotion, topics, and relevance as parallel temporal streams expose emergent patterns that traditional statistical analysis misses? This matters because conversation success may depend on interactions between dimensions, not individual features alone.

Explore related Read →

What semantic failures break dialogue coherence most realistically?

Can we distinguish distinct types of incoherence by manipulating semantic structure rather than surface text? This matters because text-level evaluations miss the semantic failures that actually occur in dialogue systems.

Explore related Read →

What six problems must every conversation solve?

Schegloff's Conversation Analysis identifies six universal organizational challenges that speakers navigate in all talk-in-interaction. Understanding these helps explain why current AI dialogue systems fall short of human fluency.

Explore related Read →

Why do dialogue systems lose context when topics return?

Stack-based dialogue management removes topics after they're resolved, making it hard for systems to reference them later. Does this structural rigidity explain why conversational AI struggles with topic revisitation?

Explore related Read →

LLM Linguistic Limitations

3 notes

Why do large language models fail at complex linguistic tasks?

Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.

Explore related Read →

Does LLM grammatical performance decline with structural complexity?

This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.

Explore related Read →

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Explore related Read →

Communicative Grounding

7 notes

Why do speakers need to actively calibrate shared reference?

Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.

Explore related Read →

Why do language models skip the calibration step?

Current LLMs assume shared understanding rather than building it through dialogue. This explores why that design choice persists and what breaks when it fails.

Explore related Read →

Do language models actually build shared understanding in conversation?

When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.

Explore related Read →

Does preference optimization damage conversational grounding in large language models?

Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.

Explore related Read →

Why don't conversational AI systems mirror their users' word choices?

Explores whether current dialogue models exhibit lexical entrainment—the human tendency to align vocabulary with conversation partners—and what's needed to bridge this gap in AI communication.

Explore related Read →

Why do language models sound fluent without grounding?

Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?

Explore related Read →

Why do clarification requests look different at each communication level?

Explores whether clarifications are unified speech acts or distinct mechanisms grounded in different modalities. Matters because dialogue systems treat clarifications uniformly, missing most of them.

Explore related Read →

Discourse Relation Asymmetry

3 notes

Why does ChatGPT fail at implicit discourse relations?

ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?

Explore related Read →

Why do LLMs handle causal reasoning better than temporal reasoning?

Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.

Explore related Read →

Why do discourse patterns predict anxiety better than single words?

Explores whether anxiety detection requires understanding how statements relate to each other rather than analyzing individual words. This matters because it reveals what computational methods need to capture cognitive distortions.

Explore related Read →

LLM Text Quality vs Human Writing

4 notes

Why do ChatGPT essays lack evaluative depth despite grammatical strength?

ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.

Explore related Read →

Does ChatGPT organize text differently than human writers?

This explores how ChatGPT relies on backward-pointing references while human academic writers use forward-pointing structure. Understanding this difference reveals different assumptions about how readers process argument.

Explore related Read →

Can we measure reading efficiency as a quality metric?

How can we quantify whether generated text delivers novel information efficiently or wastes reader attention through redundancy? This matters because standard coherence and fluency scores miss texts that are well-written but informationally dense.

Explore related Read →

Why does AI writing sound generic despite being grammatically correct?

Explores whether the robotic quality of AI text stems from grammatical failures or rhetorical ones. Understanding this distinction matters for diagnosing what AI systems actually struggle with in human-like writing.

Explore related Read →

Human-Scale Modeling and Pre-Pretraining

5 notes

Can language models learn grammar from child-scale data?

If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?

Explore related Read →

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Explore related Read →

Can formal language pretraining make language models more efficient?

Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.

Explore related Read →

What formal languages actually help transformers learn natural language?

Not all formal languages are equally useful for pre-pretraining. This explores which formal languages transfer well to natural language and why—combining structural requirements with what transformers can actually learn.

Explore related Read →

Can language models actually analyze language structure?

Explores whether LLMs can move beyond pattern matching to perform genuine metalinguistic analysis like syntactic tree construction and phonological reasoning, and what enables this capability.

Explore related Read →

Literary Language Analysis

5 notes

Can LLMs truly understand literary meaning or just mechanics?

LLMs excel at extracting metaphors, detecting style, and analyzing structure. But can they access the deeper meaning that emerges through implication, ambiguity, and evaluative judgment—the dimensions where literature actually lives?

Explore related Read →

Can one model handle all types of figurative language?

Does treating metaphor, idioms, and irony as a single pragmatic reasoning task—rather than separate classification problems—offer a more unified and effective approach to figurative language understanding in LLMs?

Explore related Read →

Do language models overestimate how often irony appears?

This explores whether LLMs systematically misread ironic intent in text, assigning higher irony scores than humans do. The gap suggests models learn irony patterns from training data without understanding their actual frequency in real communication.

Explore related Read →

Can language models truly understand literary style?

LLMs detect stylistic patterns with high accuracy, but can they grasp why those patterns matter? This explores the gap between surface-level pattern recognition and meaningful interpretation.

Explore related Read →

Where does LLM metaphor comprehension actually break down?

Literary metaphors range from conventional (dead metaphors) to novel conceptual mappings. This research asks whether LLMs fail predictably as metaphors become more abstract and creative, and what that tells us about their semantic reasoning limits.

Explore related Read →

Model Imitation and Text Fidelity

1 note

Lexical Diversity and Detection

3 notes

Can human judges detect AI writing through lexical patterns?

While AI text shows measurable differences from human writing across six lexical dimensions, judges—including experts—fail to identify AI authorship reliably. Why does perceptible quality diverge from measurable reality?

Explore related Read →

Why do newer AI models diverge further from human writing patterns?

As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?

Explore related Read →

Can humans detect AI writing if it looks natural?

Despite measurable differences in how AI generates text, human judges—even experts—consistently fail to identify it. This explores why perception lags behind measurement.

Explore related Read →