Design & LLM Interaction Language Understanding and Pragmatics

Does iterative prompt engineering undermine scientific validity?

When researchers repeatedly adjust prompts to get desired outputs, does this practice introduce hidden bias and produce unreplicable results? The question matters because LLM-based research is proliferating without clear methodological safeguards.

Note · 2026-03-28 · sourced from Prompts Prompting
How do you build domain expertise into general AI models? What kind of thing is an LLM really?

"From Prompt Engineering to Prompt Science" (2024) argues that using LLMs for scientific research through iterative prompt revision is methodologically dangerous. The standard practice — a researcher iteratively tweaks prompts until the LLM produces desired outputs — violates core scientific principles.

Three specific problems:

Individual bias and subjectivity. When a single researcher revises prompts ad hoc, personal biases shape the prompt trajectory. The researcher's expectations about what constitutes a "good" output steer the revision process, potentially embedding those expectations into the prompt without explicit awareness or documentation.

Vague or shifting criteria. Without pre-specified evaluation criteria, the definition of a "desirable outcome" drifts during prompt revision. Worse, researchers may unconsciously bend criteria to match what the LLM can produce, rather than holding the LLM to task-appropriate standards. This is a form of overfitting hypotheses to data.

Self-fulfilling prophecy. The opacity of LLMs makes feedback loops especially dangerous. A prompt revised to produce output that "looks right" may be finding a local optimum in the model's generation space that happens to align with the researcher's expectations — not because the underlying task is being solved correctly, but because the prompt has been tuned to produce the expected surface pattern.

The proposed alternative adapts qualitative coding methodology: (1) at least two qualified researchers, (2) pre-specified, community-validated evaluation criteria before any prompt revision, (3) explicit discussion of individual differences and biases, (4) iterative development until inter-coder reliability (ICR) is achieved, (5) independent validation on unseen data by different researchers.

The practical validation: applying this pipeline to user intent taxonomy construction with GPT-4 achieved high ICR between human annotators and between humans and LLM. However, the authors note a fundamental limitation: "even with enough documentation of how the final prompt was generated, one could not ensure that the same process will yield same quality output in a different situation — be it with a different LLM or the same LLM in a different time."

Foundation Priors formalization. The Foundation Priors paper (2024) provides a formal statistical framework for these dangers. It models prompt engineering as an iterative alignment process where users minimize divergence between synthetic output and their anticipated data distribution. The self-fulfilling prophecy is, formally, epistemic circularity: the user refines until the output matches their priors, then treats the match as evidence their priors are correct. The paper introduces a trust parameter λ that governs how much weight synthetic data should receive in inference — making explicit what ad-hoc prompt engineering leaves implicit (full trust, λ=1). Since Should we treat LLM outputs as real empirical data?, the Foundation Priors framework upgrades the methodological critique here into a formal epistemic one: the problem is not just unreliable method but miscategorized output.

This connects directly to the custodial shift. Since How does LLM-mediated search change what expertise requires?, the expert's new responsibility includes not just prompt skill but prompt rigor. Ad-hoc prompting is the custodial equivalent of running uncontrolled experiments — it may produce useful results, but it cannot produce trustworthy ones. The shift from engineering to science mirrors the broader shift from producing knowledge to validating AI-generated knowledge.


Source: Prompts Prompting

Related concepts in this collection

Concept map
14 direct connections · 119 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

ad-hoc prompt engineering violates scientific method — producing unreliable biased and unreplicable research outcomes that risk self-fulfilling prophecy