LLM Generated Persona is a Promise with a Catch

Paper · arXiv 2503.16527 · Published March 18, 2025

The use of large language models (LLMs) to simulate human behavior has gained significant attention, particularly through personas that approximate individual characteristics. Persona-based simulations hold promise for transforming disciplines that rely on population-level feedback, including social science, economic analysis, marketing research, and business operations. Traditional methods to collect realistic persona data face significant challenges: they are prohibitively expensive and logistically challenging due to privacy constraints, and often fail to capture multi-dimensional attributes, particularly subjective qualities. Consequently, synthetic persona generation with LLMs offers a scalable, cost-effective alternative. However, current approaches rely on ad hoc and heuristic generation techniques that do not guarantee methodological rigor or simulation precision, resulting in systematic biases in downstream tasks. Through extensive large-scale experiments including presidential election forecasts and general opinion surveys of the U.S. population, we reveal that these biases can lead to significant deviations from real-world outcomes. Based on the experimental results, this position paper argues that a rigorous and systematic science of persona generation is needed to ensure the reliability of LLM-driven simulations of human behavior. We call for not only methodological innovations and empirical foundations but also interdisciplinary organizational and institutional support for the development of this field. To support further research and development in this area, we have open-sourced approximately one million generated personas, available for public access and analysis at Tianyi-Lab/Personas.

We build on prior literature to construct three distinct types of personas: Meta Personas, Tabular Personas, and Descriptive Personas.

Identifying essential information needed in a persona. A foundational challenge in persona based simulation is identifying the essential information required for effective persona generation and how that information should be represented. The goal is to move beyond simply listing attributes to understanding what truly drives realistic simulation outcomes. Existing research offers conflicting evidence. While Argyle et al. [2023], Park et al. [2024], Salewski et al. [2023], Toubia et al. [2025] demonstrated that well-crafted conditioning can enable LLMs to simulate opinions aligned with real human responses, other studies such as Hu and Collier [2024], Gupta et al. [2023], Zheng et al. [2024], Beck et al. [2024] and our own experiments raise concerns about the efficacy and potential pitfalls of persona-based simulations. This discrepancy underscores the need to identify the crucial elements for effective persona-driven simulations, including which attributes are most important, such as demographic, psychographic (e.g., personality traits, values, attitudes, interests, lifestyles), behavioral (e.g., past actions, online activity), or contextual (e.g., social environment, current events), and the optimal format and prompting strategies for presenting that information to the LLM.

Calibrating LLM-generated personas towards real population. A parallel direction lies in accurately reconstructing realistic joint distributions of persona attributes from fragmented data sources, and subsequently calibrating these distributions to match a specific target population [Valliant et al., 2013]. Even if we identify the crucial attributes for a given simulation, generating a population of personas requires sampling from the correct distributions. Existing datasets, such as the U.S. Census, often provide only marginal distributions of individual attributes (e.g., age, income, education level). This makes it impossible to sample from the true joint distribution. While Castricato et al. [2024] offers a first step by using LLMs to filter out invalid attribute combinations sampled from marginal distributions, their method does not fully generalize to calibrate real-world joint distributions. Therefore, a crucial research direction is the development of robust sampling and calibration methods that can combine fragmented data and LLMs to accurately recover any target population.

Open-source benchmark and datasets. To accelerate progress in scientific understanding of the science of persona, we propose the creation of a large-scale, open-source benchmark dataset of rich and realistic persona profiles. Analogous to ImageNet Deng et al. [2009], a large-scale and open benchmark for persona generation would serve as a crucial resource for the research community. Specifically, this dataset would serve the following purposes: (i) a benchmark for evaluating the performance of different LLM-based persona generation methods; (ii) a training dataset such as [Toubia et al., 2025] for developing and testing new persona generation methods; and (iii) a high-quality profile library of diverse, realistic population-level personas suitable for direct use in “silicon sample” simulations. Constructing such a comprehensive dataset necessitates addressing data privacy concerns and requires a substantial investment of time and resources. However, we believe the potential benefits outweigh the effort.