Can AI agents learn people better from interviews than surveys?
Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.
The Generative Agent Simulations study (Park et al.) created agents for 1,052 real individuals using voice-to-voice interview transcripts averaging 6,491 words. When tested on the General Social Survey, these interview-based agents matched participants' own responses with 85% normalized accuracy — nearly as well as participants replicate their own answers two weeks later.
The critical finding is what drives this accuracy. Three ablation conditions isolate the mechanism:
- Summary agents — bullet-pointed factual dictionaries stripping linguistic features — still achieved 83% accuracy. This means content richness, not linguistic nuance, is the primary driver.
- Random lesion agents — removing 80% of the interview (96 of 120 minutes) — still outperformed composite agents at 79%. Even a short interview contains enough richness.
- Maximal agents — adding surveys and experiments on top of interviews — showed no improvement (85%). Surveys don't add predictive power beyond what interviews already capture.
The architecture matters too: an "expert reflection" module prompts the model to generate reflections from four domain expert personas (psychologist, behavioral economist, political scientist, demographer), then routes questions to the most relevant expert. This structured multi-perspective synthesis extracts more from the same interview data than generic reflection.
The implication challenges the dominant approach of seeding agents with demographic attributes or short persona descriptions. Those approaches achieve much lower fidelity because they provide taxonomic labels rather than the rich situational detail that interviews capture. Since Why do LLM persona prompts produce inconsistent outputs across runs?, the key difference may be that interviews provide enough specific content to anchor the model's output distribution, while short persona descriptions leave too much to the model's uncertain defaults.
However, since How do we generate realistic personas at population scale?, even 85% fidelity at the individual level may not translate to valid population-level simulation without calibration.
A related but distinct evaluation methodology — the Turing Experiment (TE) — takes the complementary approach of replicating well-established findings from prior human subject research rather than individual-level response prediction. TEs reveal a specific distortion: "hyper-accuracy" where some models (including ChatGPT and GPT-4) produce systematically more accurate crowd-wisdom estimates than representative human samples would. This connects to Can AI systems learn social norms without embodied experience? — LLMs can systematically exceed human accuracy on collective tasks, which paradoxically makes them worse simulacra of representative human populations. High individual accuracy can mask poor population-level representativeness.
Source: Personas Personality
Related concepts in this collection
-
Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
unstable under thin persona prompts; interviews may provide enough anchoring content to overcome this
-
Can AI systems learn social norms without embodied experience?
Large language models exceed individual human accuracy at predicting collective social appropriateness judgments. Does this reveal that embodied experience is unnecessary for cultural competence, or do systematic AI failures point to limits of statistical learning?
both findings show LLMs can approximate human responses without lived experience, but through different mechanisms
-
Why do LLMs fail when simulating agents with private information?
Explores whether single-model control of all social participants masks fundamental limitations in how LLMs handle information asymmetry and genuine uncertainty about others' knowledge.
simulation fidelity measured under omniscient conditions may overstate real-world applicability
-
What makes linguistic agency impossible for language models?
From an enactive perspective, does linguistic agency require embodied participation and real stakes that LLMs fundamentally lack? This matters because it challenges whether LLMs can truly engage in language or only generate text.
the 85% fidelity from text-only interview transcripts empirically challenges the strong embodiment requirement for social simulation; though the enactive view would note that the interview itself was an embodied interaction whose residue the text merely captures
-
Can AI learn social norms better than humans?
Explores whether large language models can predict cultural appropriateness more accurately than individual humans, and what this reveals about how social knowledge is transmitted and learned.
complementary evidence for the same meta-argument: social norm prediction at 100th percentile + interview-based response replication at 85% form a capability triad showing text-based learning approximates embodied social knowledge across multiple task types
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
interview-based generative agents replicate human responses 85 percent as accurately as humans replicate themselves — content richness not linguistic style is the primary driver