Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
The BabyLM Challenge included an evaluation specifically designed to distinguish two kinds of generalization:
- Surface generalization: based on sentence length, orthography, whether the sentence contains a particular word — patterns a model could use without knowing grammar
- Linguistic generalization: based on actual grammatical structure — irregular past-tense forms, control constructions, embedded clause structure
Models were fine-tuned on an ambiguous training set where labels were consistent with either generalization, then evaluated on a test set that disambiguated which one the model converged on.
The key insight: a model can produce correct outputs on typical evaluation tasks while relying on surface generalizations rather than structural ones. If test sets are not specifically designed to rule out surface heuristics, you cannot tell which kind of generalization the model is using.
This has wide implications for how we evaluate LLMs. When a model answers a grammaticality judgment task correctly, we tend to assume it has learned the relevant grammar. But it may have learned that short sentences with common words tend to be grammatical, that sentences with complex embeddings tend to be flagged as ungrammatical, or some other surface regularity that happens to correlate with the training labels.
Instruction tuning provides a striking parallel: Does instruction tuning teach task understanding or output format? shows that IT models achieve comparable accuracy even when instructions are replaced with simplified or deliberately wrong ("delusive") instructions. Models learn the output format distribution — what kind of response is expected — rather than the task semantics the instructions describe. The "instruction-following" that benchmarks measure is largely format compliance that correlates with task understanding but doesn't require it, precisely paralleling how syntactic benchmark performance correlates with grammatical knowledge but doesn't require it.
The distinction matters for robustness: surface generalizations fail on unusual structures. Linguistic generalizations are rule-governed and extend systematically to novel forms. If deployment involves unusual syntactic structures, a model relying on surface heuristics will fail — and the failure won't be predictable from standard benchmark performance.
A behavioral counterpart exists in moral reasoning: Do LLMs generalize moral reasoning by meaning or surface form?. Minimal wording changes that reverse the moral meaning of a scenario (e.g., "wrongfully convicted" → "rightfully convicted") leave LLM moral ratings nearly unchanged (r=.99) while human ratings shift substantially (r=.54). This extends the surface-generalization finding from grammatical structure into behavioral/moral reasoning — the same failure mode operating at a higher cognitive level. Humans track the semantic reversal; LLMs track the token distribution.
Source: Discourses; enriched from Training Fine Tuning
Related concepts in this collection
-
Can language models learn grammar from child-scale data?
If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
the qualification: approaching performance doesn't mean using the same underlying rules
-
Does LLM grammatical performance decline with structural complexity?
This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
the practical consequence: complex structures break surface heuristics
-
Do hedging markers actually signal careful thinking in AI?
Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
the inference-time parallel: surface markers (hedging, explicit connectives) are unreliable proxies for underlying competence, just as surface learning heuristics are unreliable proxies for grammatical rules
-
Why do language models fail at communicative optimization?
LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?
the cross-linguistic taxonomy: "Do LLMs Resemble Humans" maps exactly which regularities transfer (sound symbolism, structural priming) vs. fail (word economy, syntactic ambiguity avoidance) — the surface/structural distinction runs through all of them
-
Do LLMs generalize moral reasoning by meaning or surface form?
When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.
behavioral evidence in moral domain: same surface-over-structure failure in moral judgment
-
Can language models solve ToM benchmarks without real reasoning?
Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
ToM benchmarks are another domain where correct outputs do not prove structural learning: SFT matches RL on ToM without reasoning training, suggesting models exploit distributional patterns in benchmark structure rather than performing genuine mental state inference
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
lms may learn surface generalizations rather than linguistic generalizations despite correct outputs