Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

Paper · arXiv 2404.01869 · Published April 2, 2024

Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs’ reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models’ reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models’ reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on sophisticated reasoning abilities.

These tasks are designed to elicit the system’s capability of drawing conclusions relevant to the problem at hand. In this survey, we distinguish between core and integrated reasoning tasks. Core reasoning tasks are designed to assess fundamental reasoning skills in an isolated manner. They typically aim to test a single type of reasoning, such as logical, mathematical or causal reasoning. Examples of such tasks may include syllogisms, basic arithmetic problems or structured causal-graph predictions. Conversely, integrated reasoning tasks require the concurrent use of various reasoning types, thereby assessing a combination of fundamental reasoning skills. Examples are commonsense or scientific reasoning tasks. Such problems often reflect the complex cognitive challenges humans encounter in everyday life and professional settings.