Linguistic Blind Spots of Large Language Models
questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses
We investigate the following research questions: (1): how accurately can recent LLMs detect complex linguistic structures in input text? (2): which linguistic structures represent the blind spots of recent LLMs–meaning the most challenging for them? (3): how does the performance of LLMs vary across different levels of linguistic complexity of inputs? We answer these questions by designing an empirical study for LLMs. The contributions of this paper are in examining recent LLMs’s ability to detect specific linguistic structures across varying levels of linguistic complexity,