INQUIRING LINE

Can benchmark performance distinguish surface from structural linguistic knowledge?

This explores whether the scores a model earns on standard benchmarks can actually tell us if it learned real grammatical structure, or just surface shortcuts that happen to produce the right answer.


This explores whether the scores a model earns on standard benchmarks can actually tell us if it learned real grammar — or just surface shortcuts that mimic it. The corpus's blunt answer is: usually not, unless the test was specifically built to rule shortcuts out. BabyLM evaluations showed models producing correct outputs by leaning on sentence length, word choice, and spelling rather than grammatical rules, and that ordinary benchmarks simply can't separate the two kinds of generalization without targeted adversarial tests Can models pass tests while missing the actual grammar?. The trick that exposes the gap is structural complexity: when you crank up syntactic depth — embedded clauses, recursion, complex nominals — even top models like Llama3-70b degrade in a smooth, predictable way Why do large language models fail at complex linguistic tasks? Does LLM grammatical performance decline with structural complexity?. A model with genuine rules wouldn't fall apart as sentences nest deeper; one running on surface heuristics does exactly that. So benchmark performance *can* distinguish the two — but only when difficulty is dialed along the structural axis, not the easy-sentence axis where shortcuts still pass.

What's striking is that this same surface-vs-structure problem shows up far beyond grammar, which suggests it's a property of how these models learn rather than a quirk of syntax tests. Models systematically prefer high-frequency phrasings over semantically identical rare ones across math, translation, and reasoning — they're tracking statistical mass from pretraining, not meaning Do language models really understand meaning or just surface frequency?. Theory-of-mind benchmarks turn out to be solvable by pure pattern matching: supervised fine-tuning matches reinforcement learning, a tell-tale sign the test rewards templated artifacts rather than real mental-state reasoning Can language models solve ToM benchmarks without real reasoning? Do large language models genuinely simulate mental states?. The lesson repeats: a high score on an unguarded benchmark is ambiguous evidence.

There's a sharper twist worth knowing. Not every performance collapse means missing structural knowledge — sometimes the benchmark is measuring the wrong thing. Reasoning failures have been traced to *instance-level unfamiliarity* rather than task complexity: models succeed on any reasoning chain they've seen similar instances of, regardless of length, which means a 'complexity cliff' can actually be a novelty boundary in disguise Do language models fail at reasoning due to complexity or novelty?. And some apparent reasoning collapses are really execution-bandwidth limits — text-only models that *know* an algorithm but can't run it step by step at scale, with the cliff vanishing once you give them tools Are reasoning model collapses really failures of reasoning?. Even input length degrades reasoning well below the context window, in a way uncorrelated with language-modeling skill Does reasoning ability actually degrade with longer inputs?. The takeaway: a benchmark drop can mean 'no structural knowledge,' but it can also mean 'novel instance,' 'can't execute,' or 'too much padding' — so the score alone underdetermines the diagnosis in both directions.

The deeper point the corpus leaves you with is that no amount of clever prompting closes a structural gap — prompt optimization only reorganizes knowledge already in the training distribution and hits a hard ceiling against knowledge that was never learned Can prompt optimization teach models knowledge they lack?. So if a model lacks structural grammar, you'll see it precisely where benchmarks stress structure, and you can't paper over it at inference time. Benchmark performance is a usable instrument for telling surface from structure — but only as a *contrast*: easy items reveal nothing, and it's the predictable decline as you scale structural depth that becomes the actual signal.


Sources 10 notes

Can models pass tests while missing the actual grammar?

BabyLM evaluations showed models can produce correct outputs by relying on sentence length, word choice, and orthography rather than grammatical structure. Standard benchmarks cannot distinguish these two generalization types without tests specifically designed to rule out surface heuristics.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Next inquiring lines