Do distributed relational tasks consistently underperform local classification across NLP domains?

This reads as: do tasks that require tracking relationships across a structure (embedded clauses, long-range dependencies, multi-step logic) reliably do worse than tasks where the model just classifies a local pattern — and is that gap consistent across language domains?

This reads as a question about whether 'relating things across a span' is systematically harder for language models than 'recognizing a local pattern.' The corpus says yes, fairly consistently — but it reframes *why* in a way that's more interesting than the question assumes. The pattern isn't relational-vs-local; it's surface-distance and familiarity.

The clearest evidence is structural: top models like Llama3-70b reliably misidentify embedded clauses, verb phrases, and complex nominals, and the error rate climbs *predictably* as syntactic depth increases Why do large language models fail at complex linguistic tasks?. Pull the relevant tokens apart and performance degrades even when nothing else changes — reasoning accuracy drops from 92% to 68% with just 3,000 tokens of padding, far below the context window limit, and chain-of-thought doesn't rescue it Does reasoning ability actually degrade with longer inputs?. So the moment a task forces the model to hold a relationship across distance or depth, it weakens.

But the deeper finding is that this isn't really about 'relational structure' as a category. When semantic content is decoupled from a logic task, performance collapses even with the correct rules sitting in context — models lean on token associations and parametric commonsense, not formal manipulation Do large language models reason symbolically or semantically?. And they ignore in-context information entirely when training priors are strong enough to override it Why do language models ignore information in their context?. So 'local classification' wins not because it's local, but because it rides familiar surface statistics; relational tasks lose because they demand something the model only fakes.

Here's the turn you might not expect: one note argues the breakdown isn't driven by complexity *at all*, but by instance-level novelty. Any reasoning chain succeeds if the model trained on similar instances, regardless of length — models fit instances, not algorithms Do language models fail at reasoning due to complexity or novelty?. Under that lens, your 'distributed relational task' underperforms only when it lands in unfamiliar territory; a well-represented relational task can do fine, and an unfamiliar 'local' one can fail. The era-sensitivity work makes this concrete — models do worse on historical legal cases than modern ones purely because recent cases are over-represented in training Why do language models struggle with historical legal cases?.

The genuinely strange wrinkle: at the representational level, these models are *all* relational. Research framing LLMs through Saussure's *langue* shows they learn meaning entirely by compressing relational structure from text, with no external referents at all Can language models learn meaning without engaging the world?. So a model whose entire competence is relational still stumbles on explicit relational *tasks* — which suggests the answer to your question is 'usually yes, but the cause is novelty and surface-distance, not relationality itself.'

Sources 7 notes

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Do distributed relational tasks consistently underperform local classification across NLP domains?

Sources 7 notes

Next inquiring lines