Language Understanding and Pragmatics LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Note · 2026-02-21 · sourced from Discourses
Where exactly does language competence break down in LLMs? How should researchers navigate LLM reasoning research?

The BabyLM Challenge included an evaluation specifically designed to distinguish two kinds of generalization:

Models were fine-tuned on an ambiguous training set where labels were consistent with either generalization, then evaluated on a test set that disambiguated which one the model converged on.

The key insight: a model can produce correct outputs on typical evaluation tasks while relying on surface generalizations rather than structural ones. If test sets are not specifically designed to rule out surface heuristics, you cannot tell which kind of generalization the model is using.

This has wide implications for how we evaluate LLMs. When a model answers a grammaticality judgment task correctly, we tend to assume it has learned the relevant grammar. But it may have learned that short sentences with common words tend to be grammatical, that sentences with complex embeddings tend to be flagged as ungrammatical, or some other surface regularity that happens to correlate with the training labels.

Instruction tuning provides a striking parallel: Does instruction tuning teach task understanding or output format? shows that IT models achieve comparable accuracy even when instructions are replaced with simplified or deliberately wrong ("delusive") instructions. Models learn the output format distribution — what kind of response is expected — rather than the task semantics the instructions describe. The "instruction-following" that benchmarks measure is largely format compliance that correlates with task understanding but doesn't require it, precisely paralleling how syntactic benchmark performance correlates with grammatical knowledge but doesn't require it.

The distinction matters for robustness: surface generalizations fail on unusual structures. Linguistic generalizations are rule-governed and extend systematically to novel forms. If deployment involves unusual syntactic structures, a model relying on surface heuristics will fail — and the failure won't be predictable from standard benchmark performance.

A behavioral counterpart exists in moral reasoning: Do LLMs generalize moral reasoning by meaning or surface form?. Minimal wording changes that reverse the moral meaning of a scenario (e.g., "wrongfully convicted" → "rightfully convicted") leave LLM moral ratings nearly unchanged (r=.99) while human ratings shift substantially (r=.54). This extends the surface-generalization finding from grammatical structure into behavioral/moral reasoning — the same failure mode operating at a higher cognitive level. Humans track the semantic reversal; LLMs track the token distribution.


Source: Discourses; enriched from Training Fine Tuning

Related concepts in this collection

Concept map
21 direct connections · 175 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

lms may learn surface generalizations rather than linguistic generalizations despite correct outputs