Chain of Thoughtlessness? An Analysis of CoT in Planning

Paper · arXiv 2405.04776 · Published May 8, 2024

While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. We also create scalable variants of three domains commonly studied in previous CoT papers and demonstrate the existence of similar failure modes. Our results hint that, contrary to previous claims in the literature, CoT’s performance improvements do not stem from the model learning general algorithmic procedures via demonstrations but depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.

While initial anecdotal results were unexpectedly impressive [8], followup systematic studies showed that–outside of limited, non-generalizable classes of problems– these models generally perform poorly on basic, multi-hop reasoning tasks [17] ranging from arithmetic [35] and logic puzzles [14] to constraint satisfaction [42, 2] and classical planning [47].

subfield of prompt engineering [36] has grown rapidly, promising improvements in performance without retraining. A core tenet of this subfield is that LLMs are capable of powerful in-context learning [12, 56], that is, capable of intelligently using additional context provided in a prompt to correctly respond to queries that would otherwise be answered incorrectly. Generally, this requires operationalizing algorithmic/procedural advice, and, in principle, learning such procedures includes being able to effectively apply them beyond syntactically similar instances.

The foundational method for inducing in-context learning is the chain of thought approach, which has been claimed to "unlock the reasoning abilities of LLMs" [50]. To create a chain of thought (CoT) prompt, a user annotates similar problems with intermediate reasoning steps and prepends them to the standard prompt. These annotations are meant as demonstrations, intended to teach a procedure applicable to both the examples and the new query. When prompted like this, the LLM is expected to output a similar series of reasoning steps prior to the new answer. Numerous studies have claimed that this procedure significantly enhances LLM performance in complex reasoning tasks [49, 54, 39, 56, 52, 43]. However, in general it is unclear how "similar" the examples need to be to the problem, how broadly any given chain of thought prompt will apply, and–most importantly–how much human effort is necessary to craft prompts specific to each problem subclasses. Followup work has claimed that merely adding magic phrases ("let’s think step by step") to every prompt is sufficient for some improvement [26]. While in some domains, this technique has proven to be even more brittle than manual CoT, it has achieved the same performance increases in others, hinting that improvements observed with CoT may not indicate as much about LLMs’ general in-context learning abilities as previously thought.

We are interested in the tradeoff between possible performance gains from chain of thought prompt engineering and the amount of human labor necessary to generate examples with useful reasoning traces. Ideally, a properly constructed prompt should teach the LLM how to robustly generalize a basic algorithmic procedure in order to increase performance on a large class of problems, thereby converting a modest amount of human teaching effort into a significant capability boost. Unfortunately, this only seems to be possible to a very limited extent [14].

chain of thought approaches only improve performance when the hand-annotated examples and the query are sufficiently similar to the current query.

Modifying text prompts to elicit intermediate problem-solving steps from LLMs originally took the form of scratchpads [33]. [50] proposed a similar prompt style in natural language, dubbing this approach chain of thought (CoT), and claiming that–with some human hand-annotation of examples–this not only boosts performance without retraining, but "allows reasoning abilities to emerge naturally". They argued that by merely interspersing intermediate reasoning steps in natural language into examples, they were inducing the LLM to "learn via a few examples", motivating this idea with anthropomorphizations ("Consider one’s own thought process when solving a complicated reasoning task such as a multi-step math word problem"). [26] argued that some of the performance of CoT could be retained without providing any examples, and instead just appending the magic phrase "let’s think step by step" to the end of a prompt. This has been called zero-shot CoT.

However, CoT has long been known to be imperfect and incomplete. Previous work has investigated improving the consistency of CoT through self-consistency [49], multi-agent debate [13], least-to-most prompting [55], deductive verification [28], and other approaches. Unfortunately, many of these involve prompting the LLM multiple times for a single problem, which can balloon the cost of inference. Other work has examined the possibility of reducing or removing the need for human annotation of examples by using LLMs to generate their own examples automatically [54, 9]. To avoid well-known issues with the brittleness of LLM self-verification and self-teaching [42, 22, 20, 19, 24], we restrict this paper’s scope to manually written chains of thought

While early accounts claimed LLMs, despite not being trained for it, were capable of reasoning and planning [8], later work showcased serious brittleness across these domains [47]. [50] claims that "standard prompting only provides a lower bound on the capabilities of large language models", with proper prompting allowing reasoning to "emerge naturally." Recent work seems to maintain this optimism [7]. In this paper, we examine the effectiveness of CoT in the context of classical planning problems, which have well-defined and algorithmically checkable ground truths, can be generated with arbitrary size and difficulty, and are unlikely to be in the training data. If CoT induces more than just pattern matching, and can in fact teach LLMs to perform generalizable, compositional reasoning, then we should expect that to be reflected in robust and maintainable improvements on a simple commonsense benchmark set like Blocksworld, and we should expect these results to hold for scaled variants of the very benchmarks tested in [50] and later CoT work.

Drawing on metaphors of human learning, recent literature has claimed that LLMs are capable of in-context learning. The basic idea is that–by first presenting the model with examples of similar problems–it is possible to cause an LLM to acquire relevant new skills within the current context window.

Chain of thought [50] approaches take this further, presenting human-crafted "thoughts" which the LLM is intended to imitate in its response. Practitioners argue that, intuitively, these augmented examples teach the LLM how to solve problems in the given set.

The difficulty of a Blocksworld instance scales with the number of blocks involved, allowing us to clearly assess the out-of-domain generalization achievable with and without chain of thought. As shown in Figure 3, chain of thought does not generalize beyond a handful of blocks. Note that sound planning systems (such as Fast Downward) have a 100% accuracy on all problems tested.

With the stacking CoT prompt, performance improves to 59.3%. Is this a result of the model learning in-context how to reason correctly over this type of problem? If so, we might expect it to perform the same when presented with a more general CoT prompt that demonstrates the same procedure, but is applicable to a greater variety of problems.

To check this, we evaluate performance of our prompts on table-to-stack problems with prompts of varying granularity: standard I/O prompting, general n-shot (drawn from arbitrary Blocksworld problems), goal-specific n-shot (drawn from table-to-stack problems), and three levels of CoT specificity. Table 3 shows the results: only the most specific and least applicable prompt retains anywhere near this performance improvement. Figure A.1.1 in the appendix further illustrates that none of the prompts provide robust stack-height generalizability. We also tested self-consistency[49] on these prompts, but found that performance dropped. Details can be found in Appendix A.2.

If chain of thought is meant to replicate human thinking or learning, it should generalize beyond the most direct pattern matches and allow for more robust reasoning across similar problems. However, our results only show a modest improvement in performance on some domains, with specific enough prompting strategies, which quickly deteriorates when the problems shown become slightly larger.

7 Conclusion

In this paper, we conducted a systematic evaluation of the effectiveness of chain of thought in large language models on a specific classical planning problem. Our case study indicates that, contrary to previous claims in the literature, providing examples of procedural reasoning does not induce the general ability to apply that procedure to novel instances in current state-of-the-art large language models. In fact, the performance improvements seen when prompting LLMs in this manner quickly vanish when queries differ in generality from the examples, despite the fact that the same algorithmic procedure applies to the larger or more general instance.

hint that basic pattern matching rather than in context learning of general algorithmic procedures may better explain the improvements seen from chain of thought.