Does LLM grammatical performance decline with structural complexity?
This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
The finding from the LLM linguistic blind spots study is not simply "LLMs are bad at grammar." It is more precise: performance degrades as a function of structural complexity. Simple cases (single-clause sentences, surface noun identification) may be handled well. Complex cases (embedded clauses, recursive structures, complex nominals that look like clauses) fail systematically.
This is a useful calibration for practitioners because it makes failures predictable. You can audit task complexity before deciding whether to trust LLM annotation output. If the task involves syntactically simple inputs with explicit structural markers, LLM performance may be acceptable. If inputs contain embedded clauses, recursive modification, or other depth-increasing structures, expect systematic errors.
The inverse correlation between structural complexity and performance also has theoretical significance: it suggests that what LLMs learned from training data is more like a frequency-weighted surface heuristic than a recursive structural grammar. Complex structures are rare in training corpora, so the heuristics generalize poorly to them. The model can get the easy cases right without having internalized the underlying rule.
The practical design implication: for any application where structural correctness matters, build complexity-stratified evaluation sets. Testing only on typical (simple) inputs overestimates competence. The failure mode is in the structural tail.
Entailment reasoning extends this pattern to a new domain. Why do embedding contexts confuse LLM entailment predictions? identifies a specific structural complexity type: when premises are embedded under presupposition triggers (factive verbs, temporal clauses) or non-factive verbs, LLMs cannot discriminate the opposite effects these contexts should produce. The structural packaging overwhelms the semantic content. This is a direct instantiation of the complexity-degradation pattern: embedding contexts add structural depth, and LLMs respond to the embedding verb as a surface cue rather than computing its effect on the embedded content's entailment relations.
Source: Discourses
Related concepts in this collection
-
Why do large language models fail at complex linguistic tasks?
Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.
the broader finding this belongs to
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
the BabyLM parallel: surface heuristics pass easy tests while deeper rules are absent
-
Why do embedding contexts confuse LLM entailment predictions?
Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.
embedding contexts as a specific structural complexity type in entailment; surface cue response substitutes for semantic computation
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm grammatical competence degrades predictably as input structural complexity increases