How well do AI personas replicate real experimental findings?

Can language models simulating human personas accurately reproduce the results of published psychology and marketing experiments? Understanding this matters for validating whether AI can substitute for human subjects in research.

Note · 2026-02-22 · sourced from Personas Personality

The Viewpoints AI study systematically replicated 45 experiments from 14 Journal of Marketing articles (2023-2024), creating unique AI persona instances matching original sample sizes and demographics. Each persona received the exact stimuli and measures from the original study.

Results by evidence strength:

Main effects overall: 76% replicated (84/111)
Including interaction effects: 68% (90/133)
Strong original evidence (low p-values): high replication rate
Marginal effects (higher p-values): declining success; both false positives and false negatives
Non-significant original effects (p > 0.5): balanced — sometimes correctly identifies absence, sometimes introduces spurious findings

The p-value correlation is the key finding: LLM persona simulations function as a noisy amplifier of existing evidence. Strong effects register clearly; weak effects are in the noise floor. This means persona simulation is useful for confirming robust effects but unreliable for detecting subtle ones — precisely the effects that matter most for advancing theory.

The efficiency argument is compelling regardless: studies that took weeks can be run in minutes, potentially during a single meeting. For applied contexts — pretesting health PSAs, ad variants, social media posts — 76% main effect replication with instant turnaround may be sufficient.

However, the 24% failure rate on main effects (roughly 1 in 4 significant findings producing no difference with AI personas) means ground truth determination is unresolved. Are the human results or the AI results more representative? Since human subjects studies carry their own biases (gender, race, age, cultural context), and LLMs are trained on data containing those same biases, neither can claim definitional accuracy.

Source: Personas Personality

Related concepts in this collection

Can AI agents learn people better from interviews than surveys? Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.
85% individual vs 76% experimental; different simulation tasks, different fidelity levels
How do we generate realistic personas at population scale? Current LLM-based persona generation relies on ad hoc methods that fail to capture real-world population distributions. The challenge is reconstructing the joint correlations between demographic, psychographic, and behavioral attributes from fragmented data.
population-level bias may explain the 24% failure rate
Can AI systems learn social norms without embodied experience? Large language models exceed individual human accuracy at predicting collective social appropriateness judgments. Does this reveal that embodied experience is unnecessary for cultural competence, or do systematic AI failures point to limits of statistical learning?
convergent evidence: social norm prediction at 100th percentile and 76% experimental replication both demonstrate LLMs approximating human behavioral data from text alone, but the experimental replication shows the ceiling effect: strong effects replicate while marginal effects are noise, suggesting statistical learning captures cultural consensus better than individual variation

Concept map

13 direct connections · 108 in 2-hop network ·medium cluster

How well do AI personas replicate real experimen… Can AI agents learn people better from interviews … How do we generate realistic personas at populatio… Can AI systems learn social norms without embodied…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

LLM persona simulations replicate 76 percent of published experimental main effects but accuracy tracks original evidence strength — marginal effects are unreliable