INQUIRING LINE

Do LLMs struggle more with semantic accuracy than syntactic correctness across domains?

This explores whether LLMs are worse at getting meaning right (semantics) than at getting form right (syntax) — and the corpus complicates the premise: both fail, but in revealingly different ways.


This reads the question as asking whether meaning is the harder problem for LLMs than grammatical form. The corpus suggests a twist: syntax isn't actually safe either. Grammatical competence degrades predictably as sentences get structurally deeper — models handle simple clauses but collapse on recursion and embedding, misidentifying embedded clauses and complex nominals even in top-tier models Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?. So the cleaner finding isn't 'semantics hard, syntax easy.' It's that both syntax and semantics are handled by surface pattern-matching, and surface heuristics break down the moment structure or meaning requires more than statistical mass.

That shared mechanism is the throughline. Models systematically prefer high-frequency phrasings over semantically equivalent rare paraphrases — across math, translation, and reasoning — which means they're tracking how often a form appeared in pretraining, not what it means Do language models really understand meaning or just surface frequency?. The same statistical-mass mechanism that fakes grammatical competence also fakes comprehension. When you strip the frequency crutch, the semantic gaps show up everywhere: GPT-4 disambiguates deliberately ambiguous text only 32% of the time versus 90% for humans, failing on lexical, structural, and scope ambiguity alike Can language models recognize when text is deliberately ambiguous?.

The most striking semantic failures are ones where the model clearly 'has' the right meaning but doesn't use it. In 'potemkin understanding,' a model explains a concept correctly, fails to apply it, then recognizes its own failure — a pattern suggesting explanation and execution run on disconnected pathways Can LLMs understand concepts they cannot apply?. Similarly, models accept false presuppositions even when direct questioning proves they know the fact is wrong Why do language models accept false assumptions they know are wrong?, and they'll agree with claims they know are false out of trained agreeableness rather than ignorance Why do language models agree with false claims they know are wrong?. These aren't knowledge gaps — they're failures to bind knowledge to output.

The 'across domains' part of the question gets a sharp answer. Semantic competence is uneven by domain: general-text training leaves models confidently wrong in specialized fields like clinical inference, where prompting tricks that fix general tasks don't dent the overconfidence Why do language models fail confidently in specialized domains?. And there's a clean split between semantic and structured tasks — long-context models can match retrieval systems on meaning-based lookup but completely fail at relational queries needing joins across structured data Can long-context LLMs replace retrieval-augmented generation systems?. The dimension that actually predicts failure isn't 'semantic vs. syntactic' — it's 'covered by surface statistics vs. requires compositional structure.'

The thing you didn't know you wanted to know: the syntax/semantics distinction may be the wrong axis entirely. Across these notes the real fault line is between what frequency can fake and what requires genuine composition — holding multiple interpretations at once, applying a stated rule, joining structured facts, tracking truth against social pressure. LLMs struggle wherever the answer can't be reached by surface association, and that cuts across both grammar and meaning rather than separating them.


Sources 9 notes

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models fail confidently in specialized domains?

LLMs trained on general text lack sufficient exposure to domain-specific examples, leading to low accuracy paired with high confidence in clinical NLI tasks. Prompting techniques that improved general performance fail to reduce overconfidence in specialized domains.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Next inquiring lines