Why do large language models fail at complex linguistic tasks?
Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.
LLMs demonstrate "limited efficacy" on fine-grained linguistic annotation tasks, and the failures are not random — they are systematic and they get worse as input structural complexity increases.
The specific errors documented in Llama3-70b (one of the most capable models tested):
- Misidentifying embedded clauses
- Failing to recognize verb phrases
- Confusing complex nominals with clauses
The research examined three questions: (1) accuracy on complex linguistic structure detection, (2) which structures are LLM blind spots, (3) how performance varies with linguistic complexity. The answers: accuracy is notably limited, complex syntactic structures (especially embedded/recursive ones) are the consistent blind spots, and performance degrades predictably with structural depth.
This matters because it reveals where statistical language learning diverges from grammatical competence. LLMs trained on vast corpora learn strong surface-level patterns, but the patterns do not reliably encode the deep structural rules that govern syntax. The model knows that a sentence has a verb, but cannot reliably identify the verb phrase when the structural context is complex.
The implication for LLM deployment in NLP pipelines: any application relying on fine-grained linguistic annotation — parsing, dependency analysis, argument structure detection — cannot treat LLMs as structurally reliable without auditing their performance on complex inputs. The failures are not edge cases; they are structurally determined by input complexity.
Source: Discourses
Related concepts in this collection
-
Does LLM grammatical performance decline with structural complexity?
This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
the specific inverse relationship
-
What three layers must discourse systems actually track?
Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.
the structural competence that LLMs' annotation failures suggest is missing
-
Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
parallel finding: LLMs rely on surface cues rather than structural understanding
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llms have systematic linguistic blind spots that worsen predictably with structural complexity