Reasoning Can Hurt the Inductive Abilities of Large Language Models
Large Language Models (LLMs) have shown remarkable progress across domains, yet their ability to perform inductive reasoning—inferring latent rules from sparse examples—remains limited. It is often assumed that chain-of-thought (CoT) prompting, as used in Large Reasoning Models (LRMs), enhances such reasoning. We investigate this assumption with creating four controlled, diagnostic game-based tasks—chess, Texas Hold’em, dice games, and blackjack—with hidden human defined rules. We find that CoT reasoning can degrade inductive performance, with LRMs often underperforming their non-reasoning counterparts.
To explain this, we present a theoretical framework that reveals how reasoning steps can amplify error through three failure modes: incorrect sub-task decomposition, incorrect sub-task solving, and incorrect final answer summarization. Based on our theoretical and empirical analysis, we introduce structured interventions that adapt CoT generation according to our identified failure types. These interventions improve inductive accuracy without retraining. Our findings suggest that effective (CoT) reasoning depends not only on taking more steps but also on ensuring those steps are well-structured.
In this work, we investigate the inductive performance of LLMs and LRMs (Fig. 1). We introduce a set of controlled diagnostic game-based tasks to isolate inductive reasoning behavior in LLMs. In each task, models are presented with a short transcript of gameplay—without access to the underlying rules—and must infer the latent constraints governing legal moves and outcomes. Surprisingly, we find that LRMs often underperform non-reasoning LLMs in these settings, suggesting that CoT reasoning may introduce noise rather than clarity. We also develop a theoretical framework that explains this degradation.
Our work address the following questions: RQ1: How well do LLMs perform on inductive reasoning tasks, and has this improved with recent models? (Section 3) RQ2: Why does reasoning sometimes fail—or even hurt—inductive performance? (Section 4) RQ3: How can we guide reasoning to enhance inductive accuracy without model retraining? (Section 5)
3.3 Inductive Abilities Analysis
Fig. 2 shows rule-wise inductive accuracy across eight models in four games. Although designed to enhance multi-step reasoning, models with reasoning capabilities consistently perform worse than their non-reasoning counterparts on special rules—a pattern observed across all domains.
On normal rules, most models exceed 90% accuracy, indicating strong pattern recognition when the rule is surface-aligned or structurally obvious. In contrast, performance on special rules drops significantly. For example, in chess, non-reasoning models like GPT-4o and DeepSeek-V3 reach 55–65% on SR1, while their reasoning counterparts fall below 25%. Similar gaps appear in Texas Hold’em, Dice, and Blackjack (Fig. 2).
These results show that reasoning models struggle more with exception-based or hidden rules. This suggests that their multi-step traces may not help—and can introduce incorrect assumptions or misleading intermediate steps. We examine this hypothesis in detail in RQ2 by analyzing the reasoning outputs directly.
4.3 Empirical Analysis of Reasoning Errors
We empirically validate the theoretical taxonomy introduced in Section 4.1, which identifies three primary sources of reasoning failure: (1) Incorrect Sub-task Decomposition (Breakdown Error) arising from incorrect sub-task decomposition, (2) Incorrect Sub-task Solving (Solving Error) caused by noise in sub-task resolution, and (3) Incorrect Final Answer Summarization (Summary Error) resulting from premature or excessive reasoning steps. These failure modes are grounded in the dynamics of the belief update equation (4), where errors propagate via suboptimal alignment (αk), additive noise (εk), and misjudged stopping time N.
Among these, Solving Error dominates across all models and tasks, accounting for over 80% of failure cases. While theoretically modeled as additive noise, solving failures often exhibit structured patterns in practice. Based on prior analyses of LLM reasoning drift Based on consistent error patterns we observed in over 100 failed reasoning traces across multiple tasks and models, we classify these errors into three observable subtypes: (1) Math Overuse, where models inappropriately apply arithmetic operations to symbolic inputs (e.g., card suits or chess pieces); (2) Overgeneralization, where rules are inferred from few examples without proper validation; and (3) Hallucinated Rules, where fabricated constraints are introduced without support from input observations. Representative examples for each are provided in Appendix H.
Breakdown Errors are less frequent but still consequential, especially in structurally complex games like Texas Hold’em. These correspond to misaligned sub-task decomposition, where the model fixates on irrelevant features or ignores core inductive structure. Summary Errors are the least frequent and occur when models produce overly long or overly short reasoning chains, diverging from the optimal depth N⋆ identified in Theorem 4.1