Can language models learn grammar from child-scale data?
If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
The BabyLM Challenge trained language models on at most 100 million words (roughly the amount of language a child experiences by age 13) and compared them to human performance on linguistic tasks. The result: the best-performing models were "a few percentage points shy of human performance" on grammatical acceptability, and showed sensitivity to syntactic constraints comparable to models several orders of magnitude larger.
This challenges the strong version of the scaling hypothesis for syntax: that grammatical competence is primarily a function of data volume. If human-scale models can approach human-level syntactic performance, then syntax is learnable from human-scale exposure — which is not surprising given that children do exactly this, but which has implications for what massive-scale LLM training is actually buying.
Two important qualifications: (1) the BabyLM corpus was designed to approximate child language input, with ~56% transcribed or scripted speech; composition mattered as well as volume. (2) The finding is for syntactic tasks, not for the full range of knowledge and reasoning capabilities where scale clearly helps.
The Lil-Bevo entry from BabyLM reinforces the composition-over-volume principle: the best-performing models used curated data mixtures emphasizing child-directed speech quality, with careful attention to the distribution of linguistic structures in the training data. Data quality and curation strategy drove performance gains, not scale. This aligns with Can we train better models on less data?, which shows that selecting the right 5% of data outperforms training on everything — the same quality-over-quantity principle operating at a different level of the training stack.
The more precise claim: for linguistic generalization (grammar, syntax, structural rules), human-scale data is sufficient. For knowledge acquisition, reasoning, and factual coverage, it is not. This suggests these are different learning regimes — not just more or less of the same thing.
Implication for AI alignment and interpretability: the structural linguistic behavior of LLMs may be achievable in smaller, more interpretable models. The capabilities that genuinely require scale may be those that are hardest to study in smaller models.
Source: Discourses; enriched from Training Fine Tuning
Related concepts in this collection
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
the complication: approaching human performance doesn't mean acquiring the same grammar
-
Why do large language models fail at complex linguistic tasks?
Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.
the tension: even large-scale models have structural blind spots
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
contrast case: reasoning capability requires specialized training regardless of compute; syntactic competence does not — together these reveal that training-regime requirements are capability-specific
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
human-scale language models can approach human syntactic performance with minimal data