Does iterative prompt engineering undermine scientific validity?
When researchers repeatedly adjust prompts to get desired outputs, does this practice introduce hidden bias and produce unreplicable results? The question matters because LLM-based research is proliferating without clear methodological safeguards.
"From Prompt Engineering to Prompt Science" (2024) argues that using LLMs for scientific research through iterative prompt revision is methodologically dangerous. The standard practice — a researcher iteratively tweaks prompts until the LLM produces desired outputs — violates core scientific principles.
Three specific problems:
Individual bias and subjectivity. When a single researcher revises prompts ad hoc, personal biases shape the prompt trajectory. The researcher's expectations about what constitutes a "good" output steer the revision process, potentially embedding those expectations into the prompt without explicit awareness or documentation.
Vague or shifting criteria. Without pre-specified evaluation criteria, the definition of a "desirable outcome" drifts during prompt revision. Worse, researchers may unconsciously bend criteria to match what the LLM can produce, rather than holding the LLM to task-appropriate standards. This is a form of overfitting hypotheses to data.
Self-fulfilling prophecy. The opacity of LLMs makes feedback loops especially dangerous. A prompt revised to produce output that "looks right" may be finding a local optimum in the model's generation space that happens to align with the researcher's expectations — not because the underlying task is being solved correctly, but because the prompt has been tuned to produce the expected surface pattern.
The proposed alternative adapts qualitative coding methodology: (1) at least two qualified researchers, (2) pre-specified, community-validated evaluation criteria before any prompt revision, (3) explicit discussion of individual differences and biases, (4) iterative development until inter-coder reliability (ICR) is achieved, (5) independent validation on unseen data by different researchers.
The practical validation: applying this pipeline to user intent taxonomy construction with GPT-4 achieved high ICR between human annotators and between humans and LLM. However, the authors note a fundamental limitation: "even with enough documentation of how the final prompt was generated, one could not ensure that the same process will yield same quality output in a different situation — be it with a different LLM or the same LLM in a different time."
Foundation Priors formalization. The Foundation Priors paper (2024) provides a formal statistical framework for these dangers. It models prompt engineering as an iterative alignment process where users minimize divergence between synthetic output and their anticipated data distribution. The self-fulfilling prophecy is, formally, epistemic circularity: the user refines until the output matches their priors, then treats the match as evidence their priors are correct. The paper introduces a trust parameter λ that governs how much weight synthetic data should receive in inference — making explicit what ad-hoc prompt engineering leaves implicit (full trust, λ=1). Since Should we treat LLM outputs as real empirical data?, the Foundation Priors framework upgrades the methodological critique here into a formal epistemic one: the problem is not just unreliable method but miscategorized output.
This connects directly to the custodial shift. Since How does LLM-mediated search change what expertise requires?, the expert's new responsibility includes not just prompt skill but prompt rigor. Ad-hoc prompting is the custodial equivalent of running uncontrolled experiments — it may produce useful results, but it cannot produce trustworthy ones. The shift from engineering to science mirrors the broader shift from producing knowledge to validating AI-generated knowledge.
Source: Prompts Prompting
Related concepts in this collection
-
How does LLM-mediated search change what expertise requires?
When experts search through LLMs instead of traditional inquiry, do they need fundamentally different skills? This explores whether domain knowledge alone is enough when the search itself operates on statistical patterns rather than meaningful questions.
prompt science is the methodological dimension of the custodial shift
-
Can AI generate hundreds of fake academic papers automatically?
Explores whether language models can industrialize academic fraud by retroactively constructing theoretical justifications for data-mined patterns, complete with fabricated citations and creative signal names.
HARKing at scale: the self-fulfilling prophecy risk is amplified when AI generates both hypotheses and results
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
calibration problems in automated evaluation parallel the criteria-drift problem in prompt engineering
-
Should we treat LLM outputs as real empirical data?
Can synthetic text generated by language models serve as evidence in the same way observations from the world do? This matters because researchers increasingly rely on AI-generated content without accounting for its fundamentally different epistemic status.
formal statistical framework for the epistemic circularity this note describes methodologically
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
ad-hoc prompt engineering violates scientific method — producing unreliable biased and unreplicable research outcomes that risk self-fulfilling prophecy