Boosted Prompt Ensembles for Large Language Models
we propose a prompt ensembling method for large language models, which uses a small dataset to construct a set of few shot prompts that together comprise a “boosted prompt ensemble”. The few shot examples for each prompt are chosen in a stepwise fashion to be “hard” examples on which the previous step’s ensemble is uncertain
This “prompt engineering” can entail substantial manual effort for each individual task, and there is uncertainty about how choices impact performance (Zhou et al., 2022). For example, one recent work recommends prompts with the “longest questions” and most “complex reasoning” (Fu et al., 2022) while another suggests “only considering shorter questions with shorter rationales” (Zhang et al., 2022).
Boosted prompting, inspired by classical boosting algorithms (Freund et al., 1999), is a stagewise ensemble method that iteratively adds to a set of prompts so as to improve performance on problems just outside the frontier of what the model can currently solve (Baranes & Oudeyer, 2013). See Figure 1 for a conceptual illustration.
It has been observed that language model performance can be sensitive to the chosen prompt (Zhao et al., 2021), which has led to in-depth studies of prompting methodology (Liu et al., 2023; Wang et al., 2022a) and the development of several approaches to automatic prompt generation (Shin et al., 2020; Gao et al., 2020). While some of these approaches are gradient-based (Li & Liang, 2021; Qin & Eisner, 2021), requiring access to the model gradients, others are based on sampling (Zhou et al., 2022) or elicited via a prompt-based algorithm (Li et al., 2022a). For purposes of collecting chain of thought annotations, a handful of past works have considered self-generating the chain of thought (Zhang et al., 2022; Huang et al., 2022) and possibly validating them using the ground truth answers (Zelikman et al., 2022). We draw inspiration from these works and use model generated chains of thought when forming boosted prompts.
we use disagreement among ensemble members both as a proxy for example informativeness and, for the test-time version of our algorithm, as a measure of confidence in the correctness of the model’s prediction.