A Comprehensive Evaluation of Inductive Reasoning Capabilities and Problem Solving in Large Language Models

Paper · Source

Inductive reasoning is fundamental to both human and artificial intelligence. The inductive reasoning abilities of current Large Language Models (LLMs) are evaluated in this research. We argue that only considering induction of rules is too narrow and unrealistic, since inductive reasoning is usually mixed with other abilities, like rules application, results/rules validation, and updated information integration. We probed the LLMs with a set of designed symbolic tasks and found that even state-of-the-art (SotA) LLMs fail significantly, showing the inability of LLMs to perform these intuitively simple tasks. Furthermore, we found that perfect accuracy in a small-size problem does not guarantee the same accuracy in a larger-size version of the same problem, provoking the question of how we can assess the LLMs’ actual problem-solving capabilities. We also argue that Chain-of-Thought prompts help the LLMs by decomposing the problem-solving process, but the LLMs still learn limitedly.

evaluation at a fundamental level, e.g. symbolic level, is needed to accurately understand the reasoning abilities of LLMs.

As inductive reasoning is based on finite observations, which may contain only partial information, we cannot always expect the induced rules or results to be fully correct. Therefore, in the real world, under the surface of rules induction, the ability to validate induced rules/results and merge new rules with previous rules is equally important, and such ability to adapt to changing circumstances is important for building AI models suitable for real-world usage. To evaluate these abilities, we designed three symbolic tasks: 1) Grouping Polygons, 2) ordering named colors (Color Ordering), and 3) shifting characters in English text (Character Mapping).

We then define 3x5 experiments called Rules Application, Rules Induction, Results Validation, Rules Validation, and Rules Incorporation to evaluate the ability to apply rules, induce rules, validate induced results/rules, and merge new rules with previous rules, as depicted in Figure 1. We observe the LLMs failing on these tasks.

For evaluated LLMs, the performance varies a lot between different experiments. This unstable LLM performance on symbolic inductive reasoning tasks is in contrast to their stable/ robust performance on NLP tasks. Besides the instability, the task accuracy is low even for SotA LLMs, illustrating the weakness of LLMs in symbolic reasoning tasks.
In addition to low accuracy in Rules Induction and Rules Application, LLMs also perform poorly in Results/Rule Validation and Rules Incorporation. This suggests that besides focusing on the accuracy of LLMs, their ability to validate and check the generated results should be paid attention to.
LLMs can learn from few-shot examples and generalize beyond the given few-shot examples, but they still fail to learn scalable solutions from the examples, even when decomposing the problem-solving procedures through Chain-of-Thought (CoT) prompting.
While the LLMs may solve small-sized problems perfectly, the accuracy drops drastically when increasing the problem size. This provokes the question, "How can we prove that the LLM really holds the solution to solve specific types of problems?"