INQUIRING LINE

What distinguishes entity errors from relation errors in LLM output?

This explores whether LLMs fail differently when they get individual things wrong (entity errors — a wrong fact, name, or value) versus when they get the connections between things wrong (relation errors — how facts depend on, modify, or embed within each other), and what the corpus says about why one is harder than the other.


This explores the split between getting individual items right (entities) and getting the relationships that connect them right (relations) — and the corpus is striking on this: it almost never frames the distinction by that name, yet nearly every failure study lands on the same fault line. Entities are the easy part. Relations are where the systematic breakage lives.

The clearest evidence comes from grammar. When researchers track LLM performance as sentences get structurally deeper, simple statements are handled cleanly while recursion, embedding, and complex nominals fail in predictable ways Does LLM grammatical performance decline with structural complexity? Why do large language models fail at complex linguistic tasks?. The model can name the nouns and verbs (the entities) but misreads how a clause sits *inside* another clause (the relation). A sharper version shows up with presupposition triggers and non-factive verbs: the model treats words like "pretended" or "realized" as surface cues instead of computing how they *flip* the truth of what they embed — so it gets every entity in the sentence right and still draws the opposite conclusion Why do embedding contexts confuse LLM entailment predictions?. That's a relation error in its purest form: all the parts are correct, the wiring between them is not.

The same shape recurs above the sentence level. 'Potemkin understanding' describes a model that can define a concept accurately (entity) yet fails to apply it and even recognizes its own failure — a pattern the authors read as explanation and execution running on disconnected pathways Can LLMs understand concepts they cannot apply?. And the 'modern frame problem' work shows models don't lack the relevant facts; they fail to bring the *right background conditions forward as constraints* — forcing explicit enumeration of preconditions jumps accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?. In both cases the entities are sitting right there in the model. What's missing is the relational step that binds a fact to where it matters.

Why does this distinction earn its keep? Because the two error types behave differently over time. Entity errors tend to be local and visible — a wrong date, a swapped name. Relation errors compound. When frontier models relay documents through long delegated workflows, they silently corrupt ~25% of content with errors that build on each other without plateauing across 50 round-trips Do frontier LLMs silently corrupt documents in long workflows?. A single misread dependency early on warps everything downstream, which is exactly what you'd expect when the failure is in the structure connecting facts rather than the facts themselves.

The takeaway the corpus didn't advertise but keeps demonstrating: LLMs are statistical pattern-matchers that learned the *items* of language and knowledge far better than the *structure* binding them. So entity errors are largely a coverage-and-recall problem (did the model know the fact?), while relation errors are an architectural one (can it compute how facts constrain each other?) — and the fixes diverge accordingly: more or better data helps the first, but the second tends to need scaffolding that forces the relations to be made explicit.


Sources 6 notes

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Next inquiring lines