Psychology and Social Cognition

Can AI agents learn people better from interviews than surveys?

Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.

Note · 2026-02-22 · sourced from Personas Personality
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Generative Agent Simulations study (Park et al.) created agents for 1,052 real individuals using voice-to-voice interview transcripts averaging 6,491 words. When tested on the General Social Survey, these interview-based agents matched participants' own responses with 85% normalized accuracy — nearly as well as participants replicate their own answers two weeks later.

The critical finding is what drives this accuracy. Three ablation conditions isolate the mechanism:

  1. Summary agents — bullet-pointed factual dictionaries stripping linguistic features — still achieved 83% accuracy. This means content richness, not linguistic nuance, is the primary driver.
  2. Random lesion agents — removing 80% of the interview (96 of 120 minutes) — still outperformed composite agents at 79%. Even a short interview contains enough richness.
  3. Maximal agents — adding surveys and experiments on top of interviews — showed no improvement (85%). Surveys don't add predictive power beyond what interviews already capture.

The architecture matters too: an "expert reflection" module prompts the model to generate reflections from four domain expert personas (psychologist, behavioral economist, political scientist, demographer), then routes questions to the most relevant expert. This structured multi-perspective synthesis extracts more from the same interview data than generic reflection.

The implication challenges the dominant approach of seeding agents with demographic attributes or short persona descriptions. Those approaches achieve much lower fidelity because they provide taxonomic labels rather than the rich situational detail that interviews capture. Since Why do LLM persona prompts produce inconsistent outputs across runs?, the key difference may be that interviews provide enough specific content to anchor the model's output distribution, while short persona descriptions leave too much to the model's uncertain defaults.

However, since How do we generate realistic personas at population scale?, even 85% fidelity at the individual level may not translate to valid population-level simulation without calibration.

A related but distinct evaluation methodology — the Turing Experiment (TE) — takes the complementary approach of replicating well-established findings from prior human subject research rather than individual-level response prediction. TEs reveal a specific distortion: "hyper-accuracy" where some models (including ChatGPT and GPT-4) produce systematically more accurate crowd-wisdom estimates than representative human samples would. This connects to Can AI systems learn social norms without embodied experience? — LLMs can systematically exceed human accuracy on collective tasks, which paradoxically makes them worse simulacra of representative human populations. High individual accuracy can mask poor population-level representativeness.


Source: Personas Personality

Related concepts in this collection

Concept map
17 direct connections · 123 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

interview-based generative agents replicate human responses 85 percent as accurately as humans replicate themselves — content richness not linguistic style is the primary driver