From Prompt Engineering to Prompt Science With Human in the Loop

Paper · arXiv 2401.04122 · Published January 1, 2024

Large Language Models (LLMs), in the recent years, have become more sophisticated and capable for them to be applicable in many situations and tasks. These tasks are not limited to information extraction and synthesis [13], but also expanded to analysis, creation, and reasoning [6]. Unsurprisingly, many researchers have Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. found their uses for various research tasks. LLMs are being used in identifying relevant papers [17], synthesizing literature reviews [3], writing proposals [9], and analyzing data [30]. They have been found effective and useful in investigative tasks such as drug discovery [24, 34], and while such capabilities and wide applicability of LLMs have opened up new avenues for supporting research, there is a growing concern that a large portion of this success hinges on prompt engineering, which is often an ad-hoc method to revise prompts being fed into an LLM to achieve desired response or analysis [23].

Using such a process for scientific research could be dangerous. At best, there is a possibility of creating a feedback loop with self-fulfilling prophesy. At worst, one may be overfitting hypotheses to data, leading to untrustworthy claims and findings. In most cases, over-reliance on prompt engineering could lead to unexplainable, unverifiable, and less generalizable outcomes [18, 26], lacking a scientific approach to research.

Here, we consider a scientific approach to be the one that is well-documented and transparent, verifiable, repeatable, and free of individual biases and subjectivity. We argue that most prompt engineering techniques do not lead to a scientific approach to research as they violate at least one of these principles. While such ad-hocness can be bad in most situation, when it comes to scientific research that expects a certain level of vigor, it is especially problematic.

If we want to continue harnessing the powers of LLMs, we need to do this in a responsible manner with enough scientific rigor. For the purpose of this work, we will consider research areas and applications where an LLM is used for labeling text data.

A lot of what an LLM generates heavily depends on the inner workings of this architecture and the data used for training. Most users and researchers do not have the ability to directly change that architecture or the training process. However, they can apply various techniques such as fine-tuning [8, 10] and prompt engineering [32] to shape the outputs as they desire.

Figure 1 shows a conceptual flow of a typical prompt revision or engineering in a research setting. Here, a researcher is interested in getting some input data analyzed or labeled for a downstream application. To get the desired outcomes, the set of instructions or the prompt provided to the LLM is interatively revised. This has been used by many recent works such as [6, 14, 29, 37]. While this could accomplish the task, it leaves us with many potential issues. For instance, if the prompt is revised by a single researcher in adhoc manner just to fine-tune the output, there may be personal biases and subjectivity introduced. In addition, if this process is not well-documented, it may be hard to understand or mitigate these issues. This may also make it harder for someone to verify or replicate the process. These issues may hinder scientific progress that relies on systematic knowledge being generated, questioned, and shared. There may also be issues of self-fulfilling prophecy or feedback loop that are especially dangerous when we are dealing with highly opaque systems such as an LLM. When using LLMs for scientific research, we need some level of reliability, objectivity, and transparency in the process. Specifically, there are three problems we need to address: (1) individual bias and subjectivity; (2) vague criteria for assessment of bending the criteria to match LLM’s abilities; and (3) narrow focus to a specific application instead of a general phenomenon. For this, we propose the following steps. (

Ensure that at least two qualified researchers are involved in the whole process.

(2) Before focusing on revising the prompt to produce desirable outcomes, clearly and objectively articulate what makes a desirable outcome. This will involve establishing a reliable and if possible, community-validated metric for assessment. (

Discuss individual differences and biases with a goal of rooting them out and creating assessments and a set of instructions that can be understood and carried out by any researcher with a reasonable set of skills to be able to replicate the experiments and achieve similar results.

We identified that one of the problems with this endeavor is that it is quite ad-hoc, unreliable, and not generalizable. Even with enough documentation of how the final prompt was generated, one could not ensure that the same process will yield same quality output in a different situation – be it with a different LLM or the same LLM in a different time

As shown, for a typical qualitative coding process two researchers, who are sufficiently familiar with the task, use the initial codebook and a small set of data to independently provide labels during the training portion. The process continues until there is good enough agreement or inter-coder reliability (ICR). Once that point is reached, the codebook is considered to be ready for use. It can now be used by the same or other researchers to independently and non-overlapping label unseen data during the testing portion.

LLMs could help in constructing and applying taxonomies, but are faced with the same challenges described in Section 1 – namely, a lack of reliability and validation, a danger of creating feedback loops, and issues with replicability and generalizability. To overcome these issues, the author, with the help of a set of collaborators, applied the method proposed here. The following are the specific steps executed based on the pipeline presented in Figure 4 and described in Section 3.2. Phase 1: Set up the initial pipeline. A pipeline that included an initial prompt and a small sample of log data from Bing Chat that could create a user intent taxonomy (zero-shot approach) was set up for initial testing. Phase 2: Identify appropriate criteria to evaluate the responses. We identified five criteria – comprehensiveness, consistency, clarity, accuracy, and conciseness – for evaluating the quality of a user intent taxonomy using prior work [20] as well as human assessments as shown in Figure 4.Phase 3: Iteratively develop the prompt. We first ran the zero-shot approach to obtain initial taxonomy. Two researchers familiar with the task of creating and using a user intent taxonomy assessed the generated taxonomy through those five criteria. Appropriate revisions were made in the original prompt and a new sample of data was passed through the LLM (here, GPT4) to create a new version of the taxonomy. The researchers repeated the process until a good agreement (inter-coder reliability or ICR [7]) was obtained. Phase 4: Validate the whole pipeline. Once the taxonomy was finalized (see details in [28]), we proceeded with the testing phase of the pipeline. In this phase, the LLM was given new data and asked to use the human-validated taxonomy to label user intents. A set of two different researchers independently assigned their own labels using the same taxonomy. We then compared the labels among the human annotators as well as humans and LLM. We found high enough ICR for all of them.