Can Large Language Models Reason and Plan?

Paper · arXiv 2403.04121 · Published March 7, 2024

Their seeming versatility has however led many researchers to wonder whether they can also do well on planning and reasoning tasks typically associated with System 2 competency.

Nothing in the training and use of LLMs would seem to suggest remotely that they can do any type of principled reasoning (which, as we know, often involves computationally hard inference/search). What LLMs are good at is a form of universal approximate retrieval. Unlike databases that index and retrieve data exactly, LLMs, as ngram models, probabilistically reconstruct completions for the prompt word by word–a process we shall refer to as approximate retrieval. This means that LLMs can’t even guarantee memorizing complete answers, something that is the flip side of their appeal about constructing “novel” prompt completions on the fly. The boon (“creativity”) and bane (“hallucination”) of LLMs is that n-gram models will naturally mix and match–and have almost as much trouble strictly memorizing as we do. It is indeed the very basis of their appeal.

It is challenging to decide whether a system (or a human, for that matter) is memorizing or solving a problem from scratch–especially as the systems (or humans) get trained on larger and larger “question banks.” This is a challenge that most instructors and interviewers are acutely aware of. Think of that infamous “Why are manhole covers round?” interview question. While it may well have given the interviewer an insight into the candidate’s analytical reasoning skills the very first time it was asked, all it does with high probability now is to confirm whether the candidate trained on the interview question banks!

Perhaps they can’t do planning autonomously straight out of the box, but can they do it with a little nudge? There are broadly two popular techniques for such nudging. The first, called “fine tuning,” is rather straightforward: take a general LLM and fine tune it on planning problems (i.e., instances and their solutions), with the hope that they will subsequently make better guesses (see the left-hand side of Figure 1). While our own limited experiments didn’t show any significant improvement through fine tuning, it is possible that with even more fine tuning data and effort, the quality of LLM guesses may well improve. But all that such fine tuning is doing is converting the planning task into a memory-based approximate retrieval (akin to the memorization/compilation from System 2 to System 1; see Figure 1). It doesn’t prove that LLMs are able to plan.

Indeed two recent studies from my lab–one on plan verification10 and the other on constraint verification9–seem to throw cold water on this optimism by showing that with “self-verification” performance actually worsens. This is because LLMs hallucinate both false positives and false negatives while verifying the solutions they generate. One reason this is not recognized in earlier literature is that there self-verification claims are often made in the context of tacit knowledge tasks for which there is little possibility of a verifier (e.g. writing/improving essays), making it harder to evaluate whether LLM’s critiquing actually helped. Paradoxically, the fact that it is infeasible to write sound verifiers for tacit knowledge tasks also makes it easier to mistake LLMs for being as reasonable a critic as any!|| In other cases, an external simulator winds up playing the role of sound verification.

The skeptical reader might now ask: But what about all those papers at high profile AI conferences that claim to show planning abilities of LLMs? To analyze those claims, we need to first understand that solving planning tasks requires (a) having the necessary planning domain knowledge– the actions and their preconditions, effects; the standard hierarchical recipes (e.g. task reduction schemas in Hierarchical Task Network planning), past cases/plans etc. and (b) being able to assemble this knowledge into an executable plan that takes care of any subgoal/resource interactions. The first can be called the knowledge acquisition part, and the second reasoning/planning part. Many of the papers claiming planning abilities of LLMs, on closer examination, wind up confusing general planning knowledge extracted from the LLMs for executable plans. When all we are looking for are abstract plans, such as “wedding plans,” with no intention of actually executing said plans directly, it is easy to confuse them for complete executable plans.