Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

Paper · arXiv 2406.02061 · Published June 4, 2024
FlawsReasoning Critiques

We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the art models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical "reasoning"-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible. Various standard interventions in an attempt to get the right solution, like various type of enhanced prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs….

To shed light on this current situation, we introduce here a simple, short conventional problem that is formulated in concise natural language and can be solved easily by humans. The original problem formulation, of which we will present various versions in our investigation is as following: "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?".

We analyse the response statistics and observe strong collapse of reasoning and inability to answer the simple question as formulated above across most of the tested models, despite claimed strong reasoning capabilities. Notable exceptions are Claude 3 Opus and GPT-4 that occasionally manage to provide correct responses backed up with correct reasoning as evident in structured step by step explanations those models deliver together with solution. However, Claude 3 Opus and GPT-4 still show frequent failures to solve this simple problem across trials.