On the Limits of Innate Planning in Large Language Models
Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model–prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.
In recent years, Large Language Models (LLMs) have demonstrated state-of-the-art performance on an expanding range of tasks. Successive models exhibit increasingly sophisticated capabilities [1, 2], yet their evaluation still centers largely on mathematics and code generation benchmarks [3–5]. This narrow focus leaves the dynamic, multi-step processes of reasoning and planning under-examined. When these abilities are studied directly, progress is often far less convincing than in other domains [6–8]. These limitations hinder the deployment of LLMs in complex, real-world applications that demand robust planning and state tracking, such as autonomous agents [9].
Although LLMs can often solve such problems by writing or calling basic search code, our goal here is to isolate their intrinsic planning and state-tracking abilities without such tools. We therefore ask a concrete question: how well can LLMs plan and reason over an evolving state when they cannot rely on code execution or other external tools? To answer this, we use the standard 8-puzzle and ask models to solve random configurations of the game. The 8-puzzle requires models to maintain an accurate representation of the board, obey strict move validity constraints, and choose moves that eventually reach a goal state. We measure not only whether a puzzle is solved but also how and why runs terminate when they fail, to understand the limits of model performance on this type of task.
Across models and conditions, two dominant deficits emerge: fallible representations of the board state, leading to invalid moves, and weak heuristic planning, leading to loops or moves that do not advance the puzzle toward the goal state. This work moves beyond aggregate performance metrics to offer a granular, qualitative analysis of why LLMs struggle with such tasks and how different interventions modulate these failures.