Large Linguistic Models: Investigating LLMs' metalinguistic abilities

Paper · arXiv 2305.00948 · Published May 1, 2023

Abstract—The performance of large language models (LLMs) has recently improved to the point where models can perform well on many language tasks. We show here that—for the first time—the models can also generate valid metalinguistic analyses of language data. We outline a research program where the behavioral interpretability of LLMs on these tasks is tested via prompting. LLMs are trained primarily on text—as such, evaluating their metalinguistic abilities improves our understanding of their general capabilities and sheds new light on theoretical models in linguistics. We show that OpenAI’s (2024) o1 vastly outperforms other models on tasks involving drawing syntactic trees and phonological generalization. We speculate that OpenAI o1’s unique advantage over other models may result from the model’s chain-of-thought mechanism, which mimics the structure of human reasoning used in complex cognitive tasks, such as linguistic analysis.

In this paper, we advocate for exploring large language models’ vast potential for testing their metalinguistic competence, including their abilities to construct linguistic analyses. While our present study focuses on syntax and phonology, one can test the performance of LLMs on any theoretical linguistic skill.

Why is this line of work important? The majority of studies thus far perform behavioral tests of LLMs. Behavioral tests include tasks such as asking a model whether a sentence is (un)grammatical, or seeing if a model can correctly perform a syntactic operation such as agreement, movement, or embedding (Haider, 2023). In other words, behavioral tasks test language performance.

Here, we outline a research program where large language models are tested on higher-level metalinguistic abilities. The term metalinguistic has several interpretations (for a detailed discussion, see Bialystok et al., 1985). We use the term metalinguistic ability to refer to the ability to analyze language itself and to generate formal, theoretical analyses of linguistic phenomena—simply put, to refer to the work that linguists do. Metalinguistic ability is cognitively more complex than language use (Tunmer et al., 1984); it is acquired later, and linguistic competence is its precondition. Applying a linguistic formalism from the training data to the model’s own language ability in constructing an analysis is a complex metacognitive task.

Linguistic formalism presents the perfect testing ground for accessing the metacognitive abilities of large language models. We argue that this new research frontier can give us deeper insight into the LLMs’ general capabilities and provide a useful metric for cross-model comparison. This line of inquiry can be understood as behavioral interpretability of deep learning, where the model’s performance is evaluated through explicit metacognitive prompts rather than internal representations. Many previous studies have attempted to test whether linguistic structures are learnable from surface statistics (i. e. the relative poverty of a human learner’s input notwithstanding, given a sufficient quantity of input data, can the target grammar be acquired from statistical regularities alone?).2 Large language models acquire linguistic competence from the surface statistics of their training data. Our goal is to understand whether this is a sufficient basis to analyze language itself.