Do LLMs show stronger reasoning about causality than about temporal ordering?

This explores whether LLMs are genuinely better at reasoning about cause-and-effect than at reasoning about what happened in what order — and why that gap exists.

This explores whether LLMs reason more reliably about causality than about temporal ordering, and the corpus answers yes — but the reason is mundane, not magical. The cleanest finding is that ChatGPT excels at causal relations while stumbling on temporal order because causal connectives ("because," "therefore," "causes") are explicit and frequent in training text, whereas temporal order is usually left implicit and must be reconstructed from context Why do LLMs handle causal reasoning better than temporal reasoning?. So the asymmetry isn't a sign that models possess a deeper grasp of causation — it's that the training data hands them causal cues on a plate and makes them work for temporal ones.

The temporal weakness shows up vividly outside pure reasoning tasks too. When LLMs act as zero-shot rankers over a user's interaction history, they ignore sequence order by default — treating a list of past actions as an unordered bag rather than a timeline — and only recover that order-sensitivity when prompts explicitly foreground recency or supply in-context examples Why do language models ignore temporal order in ranking?. That's the same blind spot from a different angle: order is latent in the model but not activated unless something in the prompt points at it.

But here's the twist that complicates a clean "causal reasoning is strong" story: the causal competence is shakier than it looks. LLMs reproduce human causal *errors* exactly — weak explaining-away, Markov violations in collider structures — which suggests they're matching the statistical patterns of how people talk about cause, not running a categorical causal engine Do large language models make the same causal reasoning mistakes as humans?. The same theme recurs more broadly: when researchers strip the familiar semantics out of a reasoning task, performance collapses even when the correct rules are sitting right there in context, because models lean on token associations and parametric commonsense rather than formal manipulation Do large language models reason symbolically or semantically?. Related work on entailment shows models predicting based on whether a hypothesis looks familiar rather than whether the premise actually supports it Do LLMs predict entailment based on what they memorized?. So both "strengths" — causal and temporal — turn out to be governed by the same underlying mechanism: surface statistics, not structured inference.

That shared diagnosis is exactly why a strand of the corpus argues for *not* asking the LLM to do causal reasoning directly at all. Architectures like Causal Reflection split the work apart — a formal dynamic causal model does the reasoning, and the LLM is demoted to translating between structured inference and natural language — precisely to sidestep the spurious-correlation failures that the bias findings expose Can separating causal models from language models improve reasoning?. Structural causal models similarly let LLMs propose and test hypotheses in simulation, reliably recovering the *direction* of effects even when they can't nail the magnitudes Can structural causal models automate social science with language models?.

The thing you may not have known you wanted to know: causality itself isn't the ceiling. Even a perfect causal reasoner would miss most of how human reasoning works — associative links, analogical mappings, emotion-driven belief shifts all live outside the causal frame Can causal models alone capture how humans actually reason?. So the real story isn't "causal beats temporal." It's that LLMs are strongest wherever the training text makes the relationship explicit, and both causal and temporal performance are downstream of that single fact about what language puts on the surface versus what it leaves for the reader to infer.

Sources 8 notes

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Why do language models ignore temporal order in ranking?

LLMs can extract preferences from interaction histories but disregard temporal order by default. Recency-focused prompts and in-context examples activate latent order-sensitivity, improving ranking without retraining.

Do large language models make the same causal reasoning mistakes as humans?

LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Can separating causal models from language models improve reasoning?

Causal Reflection separates causal reasoning into a formal dynamic model with a Reflect mechanism for revision, relegating the LLM to structured inference and language rendering. This architecture sidesteps asking LLMs to perform causal reasoning directly, addressing both spurious-correlation failures and RL's explanation gap.

Can structural causal models automate social science with language models?

LLMs guided by structural causal models can propose and test causal hypotheses across negotiation, bail, interview, and auction scenarios. Simulations reveal effect directions reliably but not magnitudes, making them useful for directional social science.

Can causal models alone capture how humans actually reason?

Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.

Do LLMs show stronger reasoning about causality than about temporal ordering?

Sources 8 notes

Next inquiring lines