INQUIRING LINE

Do language models encode deep syntactic structure or only surface-level patterns?

This asks whether LLMs genuinely represent grammar — the hierarchical, rule-governed scaffolding of language — or just exploit statistical shortcuts (word length, common phrasings, surface cues) that happen to look like grammatical knowledge.


This explores the gap between models that pass grammar tests and models that actually encode grammar — and the corpus lands on a genuinely split verdict that's more interesting than either extreme. On the skeptical side, there's hard evidence for surface mimicry. Models can produce correct outputs by leaning on sentence length, word choice, and orthography rather than structure, and standard benchmarks can't tell the difference unless they're specifically built to rule out those heuristics Can models pass tests while missing the actual grammar?. Worse, the failures aren't random: even top-tier models systematically misidentify embedded clauses and complex nominals, and the error rate climbs *predictably* as syntactic depth increases — exactly the signature you'd expect if statistical pattern-matching is standing in for real grammatical rules Why do large language models fail at complex linguistic tasks?.

But the picture flips when you look inside the network instead of at its outputs. A probing study found that models spontaneously encode syntactic relations as geometry — using both distance *and* angle between embeddings (a polar-coordinate scheme) to capture the type and direction of a grammatical relation, nearly doubling accuracy over methods that read distance alone How do language models encode syntactic relations geometrically?. That's not surface bookkeeping; it's structured, symbolic-compatible representation that no one designed in. And given the right scaffolding, models can go further still: with step-by-step reasoning, o1 constructs valid syntactic trees and phonological generalizations, meaning the capacity isn't just to *use* grammar but to *analyze* it Can language models actually analyze language structure?.

The way to reconcile these is to notice the two camps aren't measuring the same thing. Depth seems to be where structure gets built — deep-and-thin small models beat wider ones precisely because composing abstract concepts across layers is what captures hierarchy Does depth matter more than width for tiny language models?. So a model can hold real structural representations internally while still failing behaviorally when the task pushes against its autoregressive grain, since failures track output *probability* rather than logical difficulty Can we predict where language models will fail?. Encoding structure and reliably deploying it are different achievements.

The lateral surprise — the thing you didn't know you wanted to know — is that this whole 'syntax' debate is the well-behaved cousin of a much harder one about *meaning*. Bender and Koller argue form-only training can never recover meaning, because meaning lives in the relation between expressions and communicative intent, which text-prediction never sees Can language models learn meaning from text patterns alone?. The optimistic counter-reading is that LLMs operationalize Saussure's *langue* — the purely relational system of language — by compressing structure out of text alone, no external referents required Can language models learn meaning without engaging the world?. Syntax may be exactly the layer where relational compression succeeds beautifully; meaning may be where it hits a wall. The same models that quietly invent polar-coordinate grammar geometry may be structurally incapable of the grounding that grammar ultimately serves.


Sources 8 notes

Can models pass tests while missing the actual grammar?

BabyLM evaluations showed models can produce correct outputs by relying on sentence length, word choice, and orthography rather than grammatical structure. Standard benchmarks cannot distinguish these two generalization types without tests specifically designed to rule out surface heuristics.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Next inquiring lines