Bigger is not always better: The importance of human-scale language modeling for psycholinguistics

Paper · Source

scaling has several downsides for both computational psycholinguistics and natural language processing research. We discuss the scientific challenges presented by the scaling paradigm, as well as the benefits that would result from language models that can learn from human-scale data. In the second half of this paper, we report on takeaways from a recent effort to bring about human-scale language model pretraining: the first iteration of the BabyLM Challenge, a shared task organized by the authors that invited participants to train a language model on 100 million words or less

the challenge demonstrated that robust syntactic and semantic generalizations can be learned by domain-flexible (though not necessarily cognitively plausible) learning algorithms trained on a human-scale dataset; indeed, some of our best-performing models were just a few percentage points shy of human performance on grammatical acceptability tasks. Second, the challenge established a population of models that are all effective data-efficient language learners. Studying these “BabyLMs” can help us identify hypotheses for the computational mechanisms that underlie human language learning

major contribution of the BabyLM Challenge was the training dataset, which we refer to as the BabyLM Corpus. Ideally, of course, our data would exactly reproduce the input received by a child. Because such datasets are currently not available, our goal in this project was to make a step in the direction of this ambitious goal. One compromise we made, for example, is that our corpus consisted only of written texts or transcriptions of spoken language, while children’s language exposure comes primarily from auditory or visual input (the latter in the case of signed languages). We reasoned that a conventional textual training corpus, despite this limitation, would attract a larger number of participants to the challenge.

the majority (≈ 56%) of the pretraining corpus was sourced from transcribed or scripted speech. This choice was made because much of the input to the typical child comes from face-to-face interaction, either through speech or sign. Transcribed speech may be particularly relevant when it comes to grammar learning, as some grammatical constructions, such as nominalizations and passives, are far more frequent in writing, while others, such as first- and second-person pronouns, are more frequent in speech (Biber, 1991).

additional fine-tuning task we included was the Mixed Signals Generalization Set (MSGS; Warstadt et al., 2020b). For this task, models were fine-tuned on an ambiguous training set where the labels were consistent with both a “linguistic generalization” and a “surface generalization.” They were then evaluated on examples that disambiguate which generalization the model converged on (if any). Surface behavior meant models were generalizing based on things like sentence length, orthography, or whether or not the sentence contained a particular word; linguistic generalization included whether or not the sentence contained an irregular past-tense form, or whether it contained a control construction.

Our fine-tuning evaluations included a subset of the tasks included in GLUE and SuperGLUE (Wang et al., 2018, 2019), consisting of various NLP tasks. Most of these tasks involve fine-tuning the model to perform classification; given an input sentence, the model is expected to sort the input into one of two classes. An example of such a classification task is natural language inference (NLI), where a model is given a premise sentence and a hypothesis sentence and has to categorize the relationship between them as entailment, contradiction or neutral. An example premise is Three tall boys are playing soccer, and a hypothesis is Some boys play sports. Other tasks used similar techniques to investigate related aspects of meaning.

the results of the challenge demonstrate that neural network learning algorithms are capable of learning linguistic generalizations, even when trained on human-scale datasets.

The most significant finding from the challenge itself is that, even at smaller data scales, current neural network architectures are very close to achieving human-level performance on many linguistic tasks. The best-performing models from the challenge showed sensitivity to syntactic constraints on par with models several orders of magnitude their size, and were just a few percentage points shy of human-level performance on this task. This is a significant achievement. Given the rate at which language modeling performance has improved recently, it is likely that computational models—even ones trained on human-scale datasets—will show sensitivities to some syntactic constraints that are on par with humans.