Do LLMs Truly Understand When a Precedent Is Overruled?
Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks, yet their ability to understand long legal documents remains insufficiently evaluated. Developing long-context benchmarks that capture realistic, high-stakes tasks remains a significant challenge in the field, as most existing evaluations rely on simplified synthetic tasks that fail to represent the complexity of real-world document understanding. Overruling relationships are foundational to common-law doctrine and commonly found in judicial opinions. They provide a focused and important testbed for long-document legal understanding that closely resembles what legal professionals actually do. We present an assessment of state-of-the-art LLMs on identifying overruling relationships from U.S. Supreme Court cases using a dataset of 236 case pairs. Our evaluation reveals three critical limitations: (1) era sensitivity – the models show degraded performance on historical cases compared to modern ones, revealing fundamental temporal bias in their training; (2) shallow reasoning – models rely on shallow logical heuristics rather than deep legal comprehension; and (3) context-dependent reasoning failures – models produce temporally impossible relationships in complex open-ended tasks despite maintaining basic temporal awareness in simple contexts. Our work contributes a benchmark that addresses the critical gap in realistic long-context evaluation, providing an environment that mirrors the complexity and stakes of actual legal reasoning tasks.
In the open-ended identification task, the models struggled to correctly identify the overruled case. The best-performing model, Gemini-Pro, achieved an accuracy of 73.31% (173 out of 236), closely followed by GPT-5 with 71.19% (168 out of 236). The other models performed worse. This suggests that even state-of-the-art models have difficulty with understanding the overruling relationship in a long legal context. There are three patterns in the errors. First, when an overruling case overturns multiple precedents, models often extract only one of them, yielding a partially correct answer. Second, models sometimes confuse cases that are merely mentioned in the text with the case actually overruled—we refer to this as confusion errors, where models incorrectly identify a case that appears in the text but is not the overruled precedent. Third, models hallucinate a case that is not mentioned in the opinion at all—we define hallucination as the generation of case names or citations that do not exist in the provided context;
While GPT-5’s low abstention rate might suggest higher confidence in its reasoning capabilities, it could also be interpreted as a tendency toward overconfident incorrectness. This trade-off highlights the need for more sophisticated uncertainty quantification mechanisms that can distinguish between genuine epistemic uncertainty and mere comprehension gaps. The other models showed a significant drop in performance, with Gemini-Flash answering “unknown” in the majority of cases.
The high number of “unknown” responses in Task 2 is revealing, as shown in Figure 2. It suggests that most models are not simply guessing, but are actively assessing their own uncertainty. When the models are unable to reason and understand the case relationship embedded in the text, they are hesitant to make a definitive judgment. This is a desirable trait in a legal AI system, as it is preferable for a model to admit its own ignorance rather than to provide a confident but incorrect answer. However, the high rate of abstention in this task suggests that most models’ threshold for certainty is set too high, preventing them from making correct judgments even when the evidence is strong.
we observed what we term “context-dependent temporal reasoning failures,” as illustrated in Table 4. These failures show that models can create temporally impossible relationships, such as suggesting that a 1914 case overruled a 1925 case, even when the case names include the years.