Why do LLMs handle causal reasoning better than temporal reasoning?
Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.
From the same discourse relations study: ChatGPT shows strong performance on causal relations — outperforming fine-tuned RoBERTa on two out of three benchmarks — while struggling with temporal order between events.
The most plausible explanation offered by the researchers: causal reasoning difficulty in temporal tasks "could be attributed to inadequate human feedback on this feature during the model's training process" — but more fundamentally, causal language is pervasive and explicitly marked in text. Explanations, arguments, news articles, scientific writing — all of these use causal connectives ("because," "therefore," "leads to," "causes") extensively and consistently.
Temporal order, by contrast, is often implicit. We say "she went to the store and bought milk" without specifying whether the events are sequential, simultaneous, or ordered in some other way. The ordering must be inferred from context, world knowledge, and linguistic cues that are less reliable than causal connectives.
The result is a capability asymmetry that tracks training data distribution: what's frequently and explicitly marked in text, LLMs learn to handle well. What's frequently implicit, they struggle with.
This is a generalizable prediction: wherever human language uses explicit, consistent surface markers, LLMs will perform better than where the same information is conveyed implicitly. Causal > temporal is one instance of this pattern. The same logic should apply to other discourse relations, pragmatic inferences, and any semantic content that is typically left implicit in language.
Shared biases, not just relative performance: The picture becomes more complex when comparing LLM causal reasoning not just against benchmarks but against human performance on the same tasks. "Do LLMs Reason Causally Like Us?" finds that on collider network reasoning (C1 → E ← C2), LLMs exhibit the same biases as humans: Markov violations (treating independent causes as positively correlated) and weak explaining away (the effect of observing one cause on reducing the probability of the other is weaker than normatively warranted). LLMs are not categorically worse at causal reasoning — they err in the same direction, likely because training data was produced by humans with these same biases. See Do large language models make the same causal reasoning mistakes as humans?.
Source: Discourses
Related concepts in this collection
-
Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
the same training-data-surface-distribution pattern at the discourse relation level
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
structural parallel: surface regularity drives performance
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
causal reasoning is stronger than temporal reasoning in llms because causal patterns dominate training data