Can Large Language Models Really Improve by Self-critiquing Their Own Plans?

Paper · arXiv 2310.08118 · Published October 12, 2023

There have been widespread claims about Large Language Models (LLMs) being able to successfully verify or self-critique their candidate solutions in reasoning problems in an iterative mode. Intrigued by those claims, in this paper we set out to investigate the verification/self-critiquing abilities of large language models in the context of planning. We evaluate a planning system that employs LLMs for both plan generation and verification. We assess the verifier LLM’s performance against ground-truth verification, the impact of self-critiquing on plan generation, and the influence of varying feedback levels on system performance. Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that self-critiquing appears to diminish plan generation performance, especially when compared to systems with external, sound verifiers and the LLM verifiers in that system produce a notable number of false positives, compromising the system’s reliability. Additionally, the nature of feedback, whether binary or detailed, showed minimal impact on plan generation. Collectively, our results cast doubt on the effectiveness of LLMs in a self-critiquing, iterative framework for planning tasks.

triumphant anecdotes about LLMs’ reasoning capabilities began to wane with systematic studies

However, these studies still rely solely on the verification/self-critiquing abilities of the LLMs, an aspect our work critically examines in the context of planning. Our results provide compelling reasons to question the use of LLMs for self-critiquing in planning.

It’s worth noting that there are no constraints on the type or format of feedback the verifier LLM produces. The system ceases generation either when the verifier LLM approves the candidate plan as valid or when the number of prompting iterations exceeds a set threshold (for our experiments, this threshold is set at 15 iterations). This method is similar to the backprompting technique described in [12]. However, the main distinction lies in the type of verifier employed. In our system, both the verifier and generator are LLMs, whereas the referenced approach utilizes an external sound verifier, VAL [4]. For all our experiments, GPT-4 serves as the default LLM.

We showed that the verifier LLM generates a significant number of false positives which be detrimental to the overall system’s reliability, particularly in domains where the correctness of plans is paramount. Interestingly, the nature of feedback, whether binary or detailed, did not have a pronounced impact on plan generation performance, suggesting that the core issue lies in the LLM’s binary verification capabilities rather than the granularity of feedback.

In the future, we plan to conduct more extensive experiments with respect to the number of instances, the number of domains and prompting methods (such as chain-of-thought).