PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Paper · arXiv 2508.09848 · Published August 13, 2025
Deep ResearchEvaluations

We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character’s prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks – as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

Second, our task encourages global reasoning. This is because (1) determining whether a consistent prequel aligns with the canonical story typically requires aggregating evidence across the whole character story; and (2) contradictory prequels often involve inconsistencies that span several scattered events due to the narrative structure of the original work. Empirically, our annotation analysis reveals that 88% of the examples in PRELUDE require non-local evidence to resolve.

Finally, our task encourages deep reasoning, because of the fact that the canonical story reflects non-immediate consequences of the prequels. To solve our task, LLMs must unfold the implications of a prequel and align them with the story, often requiring multi-step inference. For instance, the second example in Figure 1 involves reasoning that Faria was arrested when Napoleon was still emperor, and then inferring a contradiction from the fact that the Bourbon Restoration removed Napoleon from power. This kind of non-immediate causality resists shallow reasoning shortcuts that decomposes the problem into subquestions.

3.2 WHY PREQUELS?

Our prequel entailment task is naturally a long-context understanding task and a form of everyday research task. To solve the task, a model needs to judge whether a character’s prequel remains consistent with the behaviors and experiences throughout the narrative and makes counterfactual reasoning when necessary. These requirements make our task well-suited for benchmarking long-context reasoning for the following desirable properties:

• Natural long-context reasoning: The task requires holistic understanding of a narrative arc, including tracking a character’s psychological continuity, goals, and situational influences across temporally distant events.

• Cognitive research practices in daily life: While formal research is often confined to scientific domains, its core cognitive components, such as gathering evidence, forming hypotheses, and drawing conclusions, are deeply embedded in daily reasoning. Our task scenario mirrors this real-life cognition, as humans frequently make similar judgments while watching films, reading novels, or engaging in social interactions.

• Light dependency on background knowledge: This task requires little reliance on external or specialized knowledge. A reader with a full understanding of the story, even as a middle school student, can often make accurate judgments. As a result, the task emphasizes fluid intelligence rather than crystallized knowledge acquired through prior learning.