Why do short interviews outperform demographic labels for persona simulation?

This explores why feeding an LLM a person's actual interview transcript produces a more faithful simulation than tagging it with demographic categories (age, gender, party) — and what that gap reveals about how personas actually work.

This explores why feeding an LLM a person's actual interview transcript produces a more faithful simulation than tagging it with demographic categories — and the corpus points to a single underlying reason: it's the *content* that carries the person, not the label. In the largest direct test, agents built from two-hour voice interviews with 1,052 people replicated those people's own survey and experiment responses about 85% as well as the people replicated themselves on retest — and the decisive factor was factual specifics, not linguistic style. Even reducing the interview to summary bullet points kept 83% fidelity Can AI agents learn people better from interviews than surveys?. The interview works because it hands the model concrete, individuating facts to condition on, rather than a category it has to guess the contents of.

The failure of demographic labels is documented just as sharply from the other side. Conditioning LLMs on participant profiles across 208,021 people produced *no meaningful gain* in predicting any specific individual's choices Does conditioning LLMs on personal profiles improve prediction?. The reason this matters: a demographic label is a marginal — it tells you the population a person belongs to, not where they sit inside it. Population-scale persona work shows you can't recover the true joint distribution of a real person from marginal demographic data, which is exactly why label-based simulation produces systematic biases in tasks like election forecasting How do we generate realistic personas at population scale?.

There's a deeper mechanism underneath. When a persona prompt is thin, the model fills the gap with its own uncertainty: running the *same* persona prompt repeatedly produces output variance that matches or exceeds the variance between *different* personas — meaning model noise, not stable social knowledge, is driving the answer Why do LLM persona prompts produce inconsistent outputs across runs?. A demographic label is precisely such a thin prompt. An interview transcript is dense enough to pin the model down, leaving less room for that uncertainty to take over.

This reframes 'persona' from a slot you select to a record you ground in. The same lesson recurs across the collection under different terms: stakeholder personas extracted from real domain documents generalize across evaluation tasks better than hand-assigned roles Can personas extracted from documents generalize across evaluation tasks?, and PersonaAgent finds that personas built from a user's actual recent interactions cluster into genuinely user-specific regions of latent space — real separation, not the generic drift you get from a label Can personas evolve in real time to match what users actually want?. Notably, where persona simulation *does* succeed at population scale — replicating 76% of published experimental main effects — it tracks the strength of the underlying evidence, not demographic precision Can AI personas reliably replicate human experiment results?.

The thing worth taking away: the interview's advantage isn't that it's longer or more 'realistic' — it's that simulating a specific person is a retrieval problem, not a categorization one. You can't deduce an individual from the group they belong to, so anything that supplies their actual particulars beats anything that only names their category.

Sources 7 notes

Can AI agents learn people better from interviews than surveys?

A 1,052-person study found agents built from voice interviews replicated participant responses nearly as well as people replicate their own answers. Factual content, not linguistic style, drove this accuracy—even summary bullet points retained 83% fidelity.

Does conditioning LLMs on personal profiles improve prediction?

Across 208,021 participants in the Psych-201 dataset, conditioning LLMs on participant profiles did not meaningfully improve predictions for specific individuals. The standard technique for individuation produces no measurable gains in person-level forecasting.

How do we generate realistic personas at population scale?

LLM persona generation produces systematic biases in downstream tasks like election forecasting because it relies on heuristic techniques that cannot recover true joint distributions from marginal data. Solving this requires benchmarks, training datasets, and structured frameworks analogous to ImageNet.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can personas extracted from documents generalize across evaluation tasks?

MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.

Can personas evolve in real time to match what users actually want?

PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.

Can AI personas reliably replicate human experiment results?

Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.

Why do short interviews outperform demographic labels for persona simulation?

Sources 7 notes

Next inquiring lines