LLM Reasoning and Architecture Language Understanding and Pragmatics Design & LLM Interaction

Why do language models fail at temporal reasoning in complex tasks?

Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.

Note · 2026-02-21 · sourced from Domain Specialization
How do you build domain expertise into general AI models? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Supreme Court overruling benchmark documents a dissociation that clarifies the nature of LLM competence: models maintain basic temporal awareness in simple task formats but produce "temporally impossible relationships" when the same reasoning is required in complex open-ended contexts. A model can correctly answer "which case came first?" in a structured question but generate an overruling relationship that asserts a later case was overruled by an earlier one when navigating a long legal document.

This is not random error. The pattern reveals something about the architecture of LLM competence: reasoning capabilities are not general-purpose faculties that apply uniformly across task complexity levels. They are competencies that are reliable under certain surface conditions and degrade under others. The surface conditions that trigger failure include long context, multiple competing mentions of similar cases, open-ended generation rather than constrained selection, and the absence of explicit temporal markers.

The mechanistic story is plausible from what we know about Does LLM grammatical performance decline with structural complexity?: LLM competence tracks training data distribution, and complex open-ended tasks require integrating information across longer contexts in ways that may not be well-represented in training. The model falls back on frequency-weighted heuristics — whatever pattern appears most often in training cases of this type — rather than maintaining explicit reasoning through the complexity.

This has a direct connection to Why do language models struggle with historical legal cases?: both failure modes compound in the overruling task. Historical cases are more frequent targets of temporal confusion because they appear less reliably in training, and the complex open-ended format exacerbates the failure. The errors documented (extracting only one overruled case when multiple exist, confusing mentioned cases with overruled cases, hallucinating non-existent case citations) are consistent with surface-pattern matching rather than structured reasoning.

The implication for deployment: task complexity — not just domain coverage — must be treated as a reliability variable. Simple diagnostic tests in controlled settings will overestimate performance in complex production conditions. This connects to When does explicit reasoning actually help model performance?: legal overruling identification in complex documents is precisely a "continuous nuanced judgment" task — multiple competing precedents, contextual interpretation, no clear derivation path — where forcing explicit reasoning chains degrades rather than helps.


Source: Domain Specialization

Related concepts in this collection

Concept map
13 direct connections · 158 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

context-dependent reasoning failure — llms pass simple temporal tasks but fail the same reasoning in complex contexts