Reasoning with Large Language Models, a Survey
in addition to these associative “System 1” tasks, recent advances in Chain-of-thought prompt learning have demonstrated strong “System 2” reasoning abilities, answering a question in the field of artificial general intelligence whether LLMs can reason.
The field started with the question whether LLMs can solve grade school math word problems. This paper reviews the rapidly expanding field of prompt-based reasoning with LLMs. Our taxonomy identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. Finally, we highlight the relation between reasoning and prompt-based learning, and we discuss the relation between reasoning, sequential decision processes, and reinforcement learning. We find that self-improvement, self-reflection, and some metacognitive abilities of the reasoning processes are possible through the judicious use of prompts. True self-improvement and self-reasoning, to go from reasoning with LLMs to reasoning by LLMs, remains future work.
System 1 tasks, such as associative language tasks, are easily solved by LLMs with prompt-based learning, as the many school children around the world that use ChatGPT daily can attest. (Although the problems are too often not solved correctly, just fluently, when the model’s association powers lead to hallucination
We note that a simple taxonomy of generate-evaluate-control is able to describe the structure of the current LLM reasoning literature well. Furthermore, the accuracy of the reasoning chains can be improved with ensemble methods, or self-verification. Hallucination can be reduced by grounding the model with external models, such as for robotic affordances, and information retrieval from search engines and Wikipedia. Going a step further, using external control algorithms (such as search or RL) as scaffolding, dynamic prompts can use the LLMs to perform complex and interactive reasoning patterns.
Note that the reasoning control is now two layers away from the core LLM: an external control algorithm, on top of in-context-learning, dynamically generating prompts for the LLM. This is reasoning with prompts with LLMs, not by.
interesting to note the confluence of the two schools of classical artificial intelligence (AI), symbolic and connectionist.5 Search and reinforcement learning are rooted in the symbolic AI tradition, while LLMs are rooted in the connectionist tradition. The literature in this survey combines the two traditions. High performance reasoning is created with a (symbolic) searcher/learner on top of a (connectionist) LLM.
Most of the reasoning capabilities exhibited by LLMs are due to the great representational powers of the transformer architecture, and how in-context learning is able to harness them. Prompt engineering and prompt control play a crucial role in the kind of reasoning that we have seen in the papers. Models can be instructed to write their own reasoning prompts; however, such Auto-GPT or Auto-CoT prompts need evaluation, verification, and grounding in the real world, to prevent degeneration into a hallucinatory world of their own. Models can also be instructed to interact with the world, and become the tool of external scaffolding that evaluates, controls and improves the prompts. Some of what we experience as reasoning by the LLM, is controlled by the prompt or the scaffolding algorithm. It is an open question if prompt learning is able get the LLM to create a prompt to exhibit non-trivial reasoning by itself.
From the symbolic planning field there is also a critical view on the reasoning and planning abilities of LLMs [Valmeekam et al., 2023] giving examples of planning failures. They argue that LLMs can be used instead to improve heuristic elements of traditional planners, such as PDDL [Kambhampati et al., 2024], to strengthen traditional symbolic planning approaches.
Some of the names of the approaches surveyed in this paper are suggestive of self-awareness and self-reflective capabilities. True self-reflection, or metacognition, is still largely outside the capabilities of current LLMs. LLMs can be prompted to reason, to take small steps, to self-evaluate, and their search process can be controlled by an external algorithm. The self-reflective type of “intelligence” is written into the prompt by the prompt engineer or the interactive algorithm. We are unaware of any LLM that has been made to reflect on, or even control, its reasoning processes, controlling how many reasoning steps it should take, or limiting its reasoning once the answer had become good enough. True self-reflection remains future work, although some steps have been taken, as we will discuss next.
Control and prompt-learning—Search control beyond greedy search is implemented as an external algorithm. Is it possible to incorporate all stages of the reasoning pipeline into an interactive prompt? Can we make a prompt that performs dynamic search-like step control without external scaffolding?
Symbolic and Connectionist Computation—How can we further improve LLM reasoning: how can LLMs benefit from symbolic reasoning prompts and how can LLMs help ground symbolic reasoning in language?
Metacognition—Much of the research into reasoning guides the model how it should solve a problem. Is it helpful to introduce named concepts for different kinds of reasoning? Can the model find these concepts by itself? Making the LLM “think” step by step is a first step towards influencing the model’s own “thought” processes. The first works on LLM metacognition have appeared, and artificial general intelligence will pursue this further.