Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

Paper · arXiv 2302.12822 · Published February 24, 2023

“The recent success in large language models (LLMs) has shown that properly prompted LLMs demonstrate emergent capabilities on complex understanding and question-answering tasks (Wei et al., 2022a). Especially, with the recently proposed chain-of-thought (CoT) prompting (Wei et al., 2022b), LLMs are capable of solving reasoning tasks including arithmetic reasoning, commonsense reasoning, and symbolic reasoning. The basic idea of CoT prompting is adding a few rationale chains to the answer as exemplars to illustrate the intermediate reasoning steps. Following CoT, several recent studies improve it by leveraging self-consistency (Wang et al., 2023), explanation learning (Lampinen et al., 2022), complexity-based prompting (Fu et al., 2023), self-training (Huang et al., 2022), voting verifier (Li et al., 2022a), and bootstrapping (Zelikman et al., 2022).

However, most of them are constrained to a few fixed human-written exemplars, which require significant human efforts to create and adapt to new datasets. The annotation process is nontrivial because humans need to not only select the questions but also carefully design the reasoning steps for each question. In the process of searching for the perfect exemplars, we identify four critical factors that affect the performance of chain-of-thought prompting and require large human effort to deal with: (1) order sensitivity: the order combination of the exemplars; (2) complexity: the number of reasoning steps of the rationale chains; (3) diversity: the combination of different complex-level exemplars; (4) style sensitivity: the writing/linguistic style of the rationale chains. Detailed analysis of the four factors is covered in Section 2. All of these sensitivities make human-based prompt engineering costly and motivate us to find an automatic and task-agnostic way to adapt chain-of-thought exemplars to any downstream tasks.

In this paper, we solve the problem by a CoT augmentation and selection process to find suitable exemplars automatically. This can be divided into three steps: (1) Augment: The language model generates multiple pseudo-chains for query questions automatically. (2) Prune: Based on an assumption: Generating correct reasoning is a necessary condition for generating correct answers. This assumption is natural because the answer is generated after several reasoning steps. When a correct answer is generated, the rationale chain of these steps is most likely correct, contributing to the final correctness. As a result, We prune the pseudo-chains according to the consistency between generated and ground-truth answers to reduce the noise. (3) Select: Given that all the data have been annotated with rationale paths, we propose to apply a variance-reduced policy gradient strategy (Williams, 1992; Dong et al., 2020; Zhou et al., 2021; Diao et al., 2022) to estimate the gradients and optimize the selection process to find the most helpful chain-of-thought for each task. Compared to prior manually written CoT, Automate- CoT could find the optimal and diverse CoT automatically, adaptable to any task without human efforts. Compared with Auto-CoT (Zhang et al., 2023), which samples diverse questions by clustering and generates rationale chains, Automate-CoT considers and mitigates the aforementioned sensitivity issues, while achieving a greater performance boost for each task.“

“Recent studies have observed sensitivity issues of GPT-3’s few-shot learning caused by different selections of in-context examples such as order instability (Zhao et al., 2021; Zhang et al., 2022; Liu et al., 2022; Lu et al., 2022). Based on their findings, we first investigate whether these sensitivities still exist in chain-of-thought methods. Then we further explore other factors that would not only affect the performance but require human efforts to deal with. We conclude with the following four factors:

Order Sensitivity: Different orders of few-shot exemplars may cause a huge impact on the performance in traditional few-shot prompting (Lu et al., 2022). Thus we conduct experiments on GPT-3 to test if there is such sensitivity in chain-of-thought methods. Although Manual- CoT (Wei et al., 2022b) reports that the human-written CoT is robust to order changes (<2%) with the LaMDA model, we observed that the performance of GPT-3 fluctuates with different orders of chain-of-though exemplars. For the GSM8K dataset, we simply randomly shuffle the order of the exemplars in Manual-CoT 10 times and the lowest accuracy can be 59.8% which is 3.3% lower than the average accuracy (63.1%) they report, suggesting that order sensitivity still exists.

Complexity: We first define complexity as the number of hops (reasoning steps) in an exemplar where more steps indicate greater complexity. It is observed that human-written CoT tends to be simple ( 3 hops), achieving good accuracy in simple math questions while suffering from complex questions, as shown in Figure 1. In addition, a previous study (Fu et al., 2023) suggested that using all complex exemplars can improve CoT performance. However, in our experiments (Figure 1), we found that Complex-CoT can improve the accuracy of complex questions, but perform poorly in simple questions. Therefore, we conjecture that the inconsistency between the hops of provided exemplars and the required hops of the real question causes the performance drop, suggesting that determining the appropriate complexity level of exemplars is crucial.

Diversity: Based on the above discovery about complexity, a natural question is what combination of different complex-level exemplars is most effective. However, testing various combinations is a challenging task for humans and requires significant effort to determine the optimal one. In our experiments (Figure 1), we found that a combination of different complex-level exemplars outperforms CoT with all complex exemplars, suggesting a complexity-diversity trade-off.

Style Sensitivity: Previous research in educational psychology found that different learning styles would limit the cognitive benefit for students from the prompting (Papadopoulos et al., 2010). We further argue that students with specific learning styles benefit to varying degrees from different styles of prompting. In addition, the empirical evidence from Manual-CoT (Wei et al., 2022b) shows that different annotators can cause up to 28.2% accuracy difference in a symbolic reasoning task, verifying our conjecture. As a result, some bad styles may lead to a huge performance drop. However, humans cannot determine the performance of a particular style beforehand, so it requires trial and error by checking on the validation set, which further increases the effort of writing chain-of-thought exemplars.

In light of this empirical evidence, we are motivated to design a framework not only to augment the rationale chains but also to select the helpful rationale chains adaptively. With this framework, it is expected to bypass the order and style sensitivity issues, and reach a better complexity-diversity trade-off without human effort, finally boosting the performance.”