Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Paper · arXiv 2508.04699 · Published August 6, 2025
Reasoning CritiquesReasoning Methods CoT ToTKnowledge GraphsFlaws

The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematically explore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous human annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

The traditional evaluation metrics employed in these tasks, such as the final answer accuracy or the F1 score, fail to distinguish between genuine multistep inference, simple memorization (as exposed by counterfactual benchmarks such as CofCA; (Wu et al., 2025), and over-reliance on dataset artifacts. Moreover, emerging studies (Sakarvadia, 2024; Agarwal et al., 2024) show that errors may stem from missing knowledge recall, misinterpretation of question intent, or retrieval failures in retrieval-augmented settings.

With these limitations in mind, we move beyond answer correctness and undertake an investigative exploration of reasoning failures in multi-hop QA to answer a central question: How and why do reasoning models break down when stitching together information across multiple sources? To address this, we introduce a diagnostic framework that decomposes reasoning behavior along three core dimensions: (1) Hops A hop is a discrete step or transition in the reasoning process where the model moves from one piece of information (e.g., a fact, source, or knowledge base entry) to another in order to bridge connections and form a complete answer. (2) Coverage evaluates whether all necessary reasoning steps are covered; and (3) Overthinking refers to whether the model meanders into unnecessary or off-track reasoning. These dimensions support both qualitative annotation and targeted quantitative evaluation of reasoning fidelity.

Stage 1: Coarse Conceptual Labels

Our initial taxonomy used four loosely defined labels: Effective, Underthinking, Overthinking, and Faulty. These arose from manual trace inspection but lacked clear definitions. Annotators struggled to distinguish between concise reasoning and underthinking, or between verbose, incorrect reasoning and overthinking. The lack of a formal notion of reasoning hops made error tracing difficult. Faulty served as a catch-all for various errors, reducing analytical usefulness.

Stage 2: Structured Hop-Based Categorization

In the second stage, we introduced a 10-category taxonomy based on Nmodel, Ngold, hop correctness, and answer accuracy to support structured error analysis. As manual evaluation scaled, new ambiguities emerged. Category 8 (early hallucinations) often overlapped with Category 6 (underspecified chains) and question misinterpretation. Annotators also struggled to distinguish shortcut reasoning from flawed logic. These overlaps revealed that even structurally driven categories needed stronger semantic clarity.

Overthinking: This marker captures indicators of cognitive inefficiency in the model’s reasoning. It is applied when: 1) the model includes nonessential information from gold documents—such as background details, tangential facts, or calculations— that do not aid in progressing the reasoning chain; and 2) the model demonstrates repetitive or circular behavior, such as repeatedly checking the same entity or relation more than twice. Coverage: This marker addresses the completeness of source-document utilization, specifically evaluating whether the model successfully retrieves all necessary source documents. Low coverage indicates gaps in retrieval or attention, leading to incomplete reasoning chains or unsupported conclusions.

Reasoning Category Definition

Nmodel = Ngold;

Fully Correct Hops The model executes the exact number of required gold reasoning hops, and each hop is logically sound, complete, and correct. Partially Correct Hops The model executes the correct number of reasoning steps, but one or more hops involve incorrect documents, entities, or relations. The model reasoning is partially misaligned with the gold reasoning path. Nmodel < Ngold; The model executes fewer hops than required, yet all executed reasoning steps are correct and directly correspond to a subset of the required hops. This indicates incomplete but partially correct reasoning. The model executes fewer reasoning steps than required, omitting essential hops and introducing incorrect hops within the shortened chain. The reasoning is both incomplete and partially incorrect. Nmodel > Ngold; Trailing Irrelevance The model initially executes all required reasoning steps but then continues with additional irrelevant hops. These extra steps occur after completing the required reasoning and reflect the model’s extraneous elaboration. Early Irrelevance The model introduces irrelevant reasoning steps before or interspersed among the required hops. These interruptions disrupt logical reasoning progression, resulting in confusion, distraction or circular reasoning. The required reasoning steps may be partially addressed or incorrect. Question Misinterpretation The model misunderstands the original question during its early reasoning steps, often focusing on incorrect entities or setting up the wrong task, leading to fundamentally flawed reasoning.