Do Prompt-Based Models Really Understand the Meaning of Their Prompts?

Paper · arXiv 2109.01247 · Published September 2, 2021

“While last years saw a gold rush of papers (summarized in §2) that proposed automatic methods for optimizing prompts, Logan IV et al. (2021) compare a representative sample of these newly proposed methods and report that Schick and Schütze (2021b)’s manually written prompts still on average outperform the automatically searched prompts across a range of SuperGLUE tasks (Wang et al., 2019). Such findings suggest that expert-crafted prompts are among the best, if not the best, which reinforces the above hypothesis that models benefit from meaningful instructions.

In this paper, we test this hypothesis by evaluating various language models on NLI in zero-shot and few-shot settings using more than 30 manually written templates and 13 sets of LM target words for a total of over 390 prompts. We find that in most cases models learn identically as fast when given irrelevant or misleading templates as they do when given instructively good templates. Further, models ranging from 235 million to 175 billion parameters all exhibit this behavior, as do the instruction-tuned models, which are trained on hundreds of manually written prompts. While we confirm Sanh et al. (2021)’s finding that instruction tuning substantially improves the performance and robustness of prompts, we also find that instruction tuned models can be, in some sense, too robust and less sensitive to the semantics of the prompts, as compared to their non-instruction-tuned equivalents. Finally, models are much more sensitive to the choice of the LM target words as opposed to the meaning of the instruction templates. In sum, despite prompt-based models’ dramatic improvement in zero-shot and few-shot learning, we find limited evidence that models’ improvement is derived from models understanding task instructions in ways analogous to humans’ use of task instructions.”