Where exactly do language models fail at structural language tasks? · Gravity7

Discourse Structure Theory

6 notes

What three layers must discourse systems actually track?

Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.

How do readers track segments, purposes, and salience together?

Can discourse processing actually happen in parallel rather than sequentially? This matters because understanding how readers coordinate multiple layers of meaning at once reveals where AI systems break down in comprehension.

Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?

Does encoding linguistic complexity, emotion, topics, and relevance as parallel temporal streams expose emergent patterns that traditional statistical analysis misses? This matters because conversation success may depend on interactions between dimensions, not individual features alone.

What semantic failures break dialogue coherence most realistically?

Can we distinguish distinct types of incoherence by manipulating semantic structure rather than surface text? This matters because text-level evaluations miss the semantic failures that actually occur in dialogue systems.

What six problems must every conversation solve?

Schegloff's Conversation Analysis identifies six universal organizational challenges that speakers navigate in all talk-in-interaction. Understanding these helps explain why current AI dialogue systems fall short of human fluency.

Why do dialogue systems lose context when topics return?

Stack-based dialogue management removes topics after they're resolved, making it hard for systems to reference them later. Does this structural rigidity explain why conversational AI struggles with topic revisitation?

LLM Linguistic Limitations

3 notes

Why do large language models fail at complex linguistic tasks?

Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.

Does LLM grammatical performance decline with structural complexity?

This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Communicative Grounding

7 notes

Why do speakers need to actively calibrate shared reference?

Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.

Why do language models skip the calibration step?

Current LLMs assume shared understanding rather than building it through dialogue. This explores why that design choice persists and what breaks when it fails.

Do language models actually build shared understanding in conversation?

When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.

Does preference optimization damage conversational grounding in large language models?

Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.

Why don't conversational AI systems mirror their users' word choices?

Explores whether current dialogue models exhibit lexical entrainment—the human tendency to align vocabulary with conversation partners—and what's needed to bridge this gap in AI communication.

Why do language models sound fluent without grounding?

Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?

Why do clarification requests look different at each communication level?

Explores whether clarifications are unified speech acts or distinct mechanisms grounded in different modalities. Matters because dialogue systems treat clarifications uniformly, missing most of them.

Discourse Relation Asymmetry

3 notes

Why does ChatGPT fail at implicit discourse relations?

ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?

Why do LLMs handle causal reasoning better than temporal reasoning?

Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.

Why do discourse patterns predict anxiety better than single words?

Explores whether anxiety detection requires understanding how statements relate to each other rather than analyzing individual words. This matters because it reveals what computational methods need to capture cognitive distortions.

LLM Text Quality vs Human Writing

4 notes

Why do ChatGPT essays lack evaluative depth despite grammatical strength?

ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.

Does ChatGPT organize text differently than human writers?

This explores how ChatGPT relies on backward-pointing references while human academic writers use forward-pointing structure. Understanding this difference reveals different assumptions about how readers process argument.

Can we measure reading efficiency as a quality metric?

How can we quantify whether generated text delivers novel information efficiently or wastes reader attention through redundancy? This matters because standard coherence and fluency scores miss texts that are well-written but informationally dense.

Why does AI writing sound generic despite being grammatically correct?

Explores whether the robotic quality of AI text stems from grammatical failures or rhetorical ones. Understanding this distinction matters for diagnosing what AI systems actually struggle with in human-like writing.

Human-Scale Modeling and Pre-Pretraining

5 notes

Can language models learn grammar from child-scale data?

If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Can formal language pretraining make language models more efficient?

Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.

What formal languages actually help transformers learn natural language?

Not all formal languages are equally useful for pre-pretraining. This explores which formal languages transfer well to natural language and why—combining structural requirements with what transformers can actually learn.

Can language models actually analyze language structure?

Explores whether LLMs can move beyond pattern matching to perform genuine metalinguistic analysis like syntactic tree construction and phonological reasoning, and what enables this capability.

Literary Language Analysis

5 notes

Can LLMs truly understand literary meaning or just mechanics?

LLMs excel at extracting metaphors, detecting style, and analyzing structure. But can they access the deeper meaning that emerges through implication, ambiguity, and evaluative judgment—the dimensions where literature actually lives?

Can one model handle all types of figurative language?

Does treating metaphor, idioms, and irony as a single pragmatic reasoning task—rather than separate classification problems—offer a more unified and effective approach to figurative language understanding in LLMs?

Do language models overestimate how often irony appears?

This explores whether LLMs systematically misread ironic intent in text, assigning higher irony scores than humans do. The gap suggests models learn irony patterns from training data without understanding their actual frequency in real communication.

Can language models truly understand literary style?

LLMs detect stylistic patterns with high accuracy, but can they grasp why those patterns matter? This explores the gap between surface-level pattern recognition and meaningful interpretation.

Where does LLM metaphor comprehension actually break down?

Literary metaphors range from conventional (dead metaphors) to novel conceptual mappings. This research asks whether LLMs fail predictably as metaphors become more abstract and creative, and what that tells us about their semantic reasoning limits.

Model Imitation and Text Fidelity

1 note

Can imitating ChatGPT fool evaluators into thinking models improved?

Explores whether fine-tuning weaker models on ChatGPT outputs creates an illusion of capability gains. Investigates why human raters and automated judges fail to detect that imitation improves style but not underlying factuality or reasoning.

Lexical Diversity and Detection

3 notes

Can human judges detect AI writing through lexical patterns?

While AI text shows measurable differences from human writing across six lexical dimensions, judges—including experts—fail to identify AI authorship reliably. Why does perceptible quality diverge from measurable reality?

Why do newer AI models diverge further from human writing patterns?

As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?

Can humans detect AI writing if it looks natural?

Despite measurable differences in how AI generates text, human judges—even experts—consistently fail to identify it. This explores why perception lags behind measurement.

Related Areas

6 notes

Where exactly does language competence break down in LLMs?

LLMs handle surface-level language patterns well but fail systematically on tasks requiring inference and structural depth. Understanding where and why these failures occur reveals what LLMs have actually learned about language.

Why do LLMs fail at understanding what remains unsaid?

LLMs excel at pattern-matching surface language but struggle with pragmatics—meaning derived from context, speaker intent, and what's deliberately left implicit. This gap reveals a fundamental limitation in how LLMs acquire language competence compared to humans.

What kind of thing is an LLM really?

This hub explores whether LLMs are fundamentally different from human cognition or share deeper structural similarities. The research draws on philosophy, neuroscience, and mechanistic analysis to locate where LLMs diverge from human intelligence and where they converge.

Why do AI conversations reliably break down after multiple turns?

Explores why multi-turn conversations degrade in quality and coherence. Understanding failure modes—intent misalignment, memory management, and missing grounding mechanisms—is essential for designing more resilient dialogue systems.

Where exactly does language competence break down in LLMs?

LLMs handle surface-level language patterns well but fail systematically on tasks requiring inference and structural depth. Understanding where and why these failures occur reveals what LLMs have actually learned about language.

How should researchers navigate LLM reasoning research?

This note explores how to systematically explore interconnected insights about test-time scaling, reasoning architectures, and language model cognition. It matters because LLM research spans multiple domains—from inference compute to philosophy—and understanding the map helps identify novel connections.