Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings

Paper · arXiv 2408.16073 · Published August 28, 2024

Our LLM replications successfully reproduced 76% of the original main effects (84 out of 111), demonstrating strong potential for AI-assisted replication of studies in which people respond to media stimuli. When including interaction effects, the overall replication rate was 68% (90 out of 133). The use of LLMs to replicate and accelerate marketing research on media effects is discussed with respect to the replication crisis in social science, potential solutions to generalizability problems in sampling subjects and experimental conditions, and the ability to rapidly test consumer responses to various media stimuli.

Specifically, we systematically reviewed all articles in the Journal of Marketing published from January 2023 through May 2024. This resulted in an initial corpus of 69 papers containing 210 unique studies. The Journal of Marketing was chosen for this initial test of AI replication accuracy because it frequently publishes tests of media message effectiveness (consistent with our interest in commercial and theoretical work about media psychology), and because journal policies encourage detailed reporting about measures, sampling and inclusion of actual visual and textual material used in studies.

We then reviewed this sample of candidate studies to evaluate suitability for AI-assisted replication. We applied the following inclusion criteria: (1) the study had to be a true experiment incorporating manipulated study conditions (not simply a survey that scored participant attitudes or beliefs or a study that compared correlations between individual difference variables and outcomes); (2) it had to be compatible with the features of our AI software (e.g., manipulating stimuli between conditions and presenting all questions at the end); (3) all original study materials (i.e., stimuli and measures) needed to be provided by the authors or otherwise publicly available; and (4) the study procedures or outcomes could not require physical actions or behavioral measures (e.g., eye-tracking, monitoring of subsequent purchasing behavior). In essence, for this initial test, we selected experiments that could typically be conducted through online recruitment platforms like Mechanical Turk or Prolific. This selection process resulted in a final total of 45 studies sourced from 14 distinct research articles

Viewpoints AI is software designed to test AI responses to different versions of multimodal media. The software allows researchers to input various media stimuli (images, videos, or text), organize them into experimental conditions, specify participant characteristics, and define survey questions, and scales. The system then generates responses from participants based on these parameters. For each study, a series of unique LLM instantiations— one for each virtual persona—is created on the fly (i.e. in real time as the study was run) to exactly match the sample distributions, characteristics, and context as given in the actual study. Each persona was then given the exact text, image, and/or video stimulus used in the original study to view, along with all other original study instructions. The creation of a unique AI instance for each virtual persona differentiates Viewpoints AI from other attempts to use AI to answer questions in social science research.

Our software then constructed a prompt that instructed the LLM to i) embody the assigned persona, ii) examine the presented stimuli (which could include text, images, videos, or any combination thereof), and iii) respond to the subsequent questions. The question wordings and response scales provided to the generated participants were directly transplanted from the original experiments, maintaining fidelity to the source material. This approach allowed for flexibility in accommodating various question types and scale formats, ranging from open-ended queries (e.g., “What is the highest price you would be willing to pay for this product?”) to Likert-style scales of varying points (e.g., 1 = very unlikely, 7 = very likely).

Figure 1 (next page) illustrates a case study replicating an experiment on packaging design effects from Study 1a in Ton et al, 2024. This study examined how complex versus simple packaging designs influenced consumer perceptions across four DVs: willingness to pay, few-ingredients inferences, perceived product purity, and design attractiveness. We replicated the original experimental conditions using Viewpoints AI, generating 362 AI personas to match the original sample size and characteristics. The personas were presented with the same stimuli (complex or simple package designs) and responded to identical measures as in the human study.

Not all original study results replicated similarly, however, with implications for capabilities and limitations of LLMs for consumer behavior research. First, the high reliability of LLMs in replicating strongly significant findings suggests their value for confirming robust effects. Second, the observed decline in replication success as p values increase emphasizes the critical role of the original evidence's strength in the interpretation of LLM-based replications. Stronger original evidence is more likely to be successfully replicated by LLMs, which is an important consideration for researchers relying on these models. Third, the mixed performance of LLMs on findings with marginal or non-significant p values raises concerns about AI sensitivity to subtle effects. This variability suggests that LLMs may be prone to both false positives and false negatives, indicating a potential risk when using them to detect or confirm less pronounced effects. Last, the balanced replication outcomes for p values above 0.5 reveal that while LLMs may sometimes accurately identify the absence of an effect, they also risk introducing spurious findings.

revolution in research efficiency could depend primarily on predictive rather than explanatory advancements. But regardless of whether AI can explain how and why humans reason about media messages, if AI can make an accurate prediction about results, using any path available in the LLM, the nature of research could be dramatically changed.

Much of applied research is interested in specific tests of alternatives for persuasive messages, often in the context of message design exercises that proceed quickly and without significant resources. For example, researchers might want to pretest versions of a health PSA designed to change behavior, TV advertisements designed to solidify brands, or social media posts seeking to promote clicks. In any of these applied settings, new studies using AI personas could be conducted in minutes, maybe even during the one-hour meeting where designers propose, test, and select a finalist message.

As media psychologists, we volunteer to link theories with characteristics of media; for example, the pacing of presentations, visual vs. textual emphases, interactive potential. But those features constantly change. Currently, much of media psychology theory is based on media of decades past, and especially television. New technologies dramatically shift what theories should be developed to make certain we are studying the most essential features of the stimuli that ground our interests.

Thus, while virtual participants may not permit substantive replication of interaction effects reported in the existing literature, such an approach appears no worse than what can be gleaned through replication with human subjects and may actually permit a faster, less expensive route towards assessing the relative validity of previously reported moderation findings.

about 1 in 4 of the statistically significant main effects reviewed resulted in no significant differences when using AI personas.

Reconciling the differences is not (yet) straightforward. There are two different arguments for determining which results, human subjects or AI personas, represent the most accurate characterization of an effect when there are differences. So far in the literature, ground truth is mostly the province of results based on human data. We know, however, that there are multiple critiques of human subjects studies, not necessarily forwarded in the context of AI alternatives, that limit conclusions from those studies. These critiques prominently include biases associated with gender, race, age, and cultural context. Since the LLM models are trained on information that also includes those biases, it is possible (and even likely) that biases get transferred into the AI models. These biases might be made less influential in the AI models; for example, using our replication tool, studies could be changed to include samples of different (and hard to acquire) demographic backgrounds. More inclusive AI samples might even flip the ground truth assumptions, making diversity made possible with AI the gold standard.

AI represents one of the first major technology developments that has flipped the progression of research from university labs to technology companies. For AI, it is the technology companies who now own access to the models and other labs who are trying to understand how the models work.

We believe the efficiency rationale for pursuit of AI personas is compelling