Which linguistic abilities are learnable from human-sized data exposure?

This explores what language abilities models can pick up from a child-sized diet of data — roughly the ~100 million words a human encounters growing up — rather than from internet-scale training.

This explores what language abilities models can pick up from a child-sized diet of data, and the corpus has a surprisingly sharp answer for one ability in particular: grammar. Models trained on 100 million words or fewer land within a few points of human performance on grammatical acceptability judgments, which suggests that core syntactic competence — knowing what sounds well-formed — doesn't need oceans of text Can language models learn grammar from child-scale data?. The interesting twist is that *how* you feed the data mattered more than how much: composition and curation beat raw volume. So the honest framing isn't "big data teaches grammar" but "a well-chosen small corpus is enough."

Architecture turns out to be part of the same story. At small scale, the shape of the model changes what it can squeeze from limited data — deep-and-thin networks outperform wide ones at the sub-billion-parameter range because stacking layers lets the model compose abstract structure rather than just memorize more surface patterns Does depth matter more than width for tiny language models?. Read alongside the syntax result, the lesson is that human-scale learnability is as much about the learner's design as about the volume of exposure.

But syntax is the easy case, and the corpus quietly warns you not to generalize from it. Higher abilities behave differently. Surprisingly, social and cultural knowledge seems *learnable without embodiment* — GPT-4.5 beat every individual human at judging social appropriateness across hundreds of scenarios, even though it never lived in a culture Can AI learn social norms better than humans?. That cuts against the intuition that you need lived experience to absorb norms. Yet the same models share identical blind spots on unwritten norms, hinting that something about pattern exposure tops out where the rules were never written down.

And the thing that *doesn't* come for free is the gap between knowing and doing. Models can state a concept correctly, then fail to apply it, then even recognize their own failure — a pattern that has no human analogue and points to explanation and execution running on disconnected pathways Can LLMs understand concepts they cannot apply?. So "learnable from human-sized data" splits cleanly: the formal machinery of language (grammar, acceptability) arrives early and cheaply; functional understanding that holds up under application does not.

If you want to push the boundary further, the corpus offers two adjacent angles. One line shows models becoming strong predictors of *human* behavior and decision-making after fine-tuning on psychology data — language ability bending toward modeling people rather than just producing sentences Can language models learn to model human decision making?. Another reframes the whole question philosophically: from the outside humans and LLMs are categorically different systems, but inside shared discourse they draw on the same symbolic substrate, which is why a model trained on text alone can sound so fluently human Do humans and LLMs differ fundamentally or just superficially?. The takeaway you didn't know you wanted: the abilities that scale down to human-sized data are precisely the ones encoded in the structure of language itself — and the ones that don't are the ones that live in use.

Sources 6 notes

Can language models learn grammar from child-scale data?

Models trained on ≤100 million words performed within a few percentage points of humans on grammatical acceptability tasks, suggesting syntactic competence doesn't require massive datasets. Data composition and curation mattered more than raw volume.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can AI learn social norms better than humans?

GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Which linguistic abilities are learnable from human-sized data exposure?

Sources 6 notes

Next inquiring lines