Generative Agent Simulations of 1,000 People

Paper · arXiv 2411.10109 · Published November 15, 2024

We present a novel agent architecture that simulates the attitudes and behaviors of 1,052 real individuals—applying large language models to qualitative interviews about their lives, then measuring how well these agents replicate the attitudes and behaviors of the individuals that they represent. The generative agents replicate participants' responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later

We recruited over 1,000 participants using stratified sampling to create a representative U.S. sample across age, gender, race, region, education, and political ideology. Each participant completed a voice-to-voice interview in English, producing transcripts with an average length of 6,491 words per participant (std = 2,541; SM 1). To facilitate this process, we developed an AI interviewer (SM 2) that conducted the interview using a semi-structured interview protocol.

When an agent is queried, the entire interview transcript is injected into the model prompt, instructing the model to imitate the person based on their interview data. For experiments requiring multiple decision-making steps, agents were given memory of previous stimuli and their responses to those stimuli through short text descriptions. The resulting agents can respond to any textual stimulus, including forced-choice prompts, surveys, and multi-stage interactional settings.

We evaluated the generative agents on their ability to predict their source participants’ responses to a series of surveys and experiments commonly used across social science disciplines.

To assess the contribution of interviews to the generative agents' predictive accuracy, we compared the performance of interview-based generative agents with two baselines that replace interview transcripts with alternative forms of description. These baselines are grounded in how language models have been used to proxy human behaviors in prior studies: one using demographic attributes (13, 38), and the other using a paragraph summarizing the target person’s profile (14). For the demographic-based generative agents, we used participants' responses to GSS questions to capture individuals’ age, gender, race, and political ideology—demographic attributes commonly used in previous studies (38). For the persona-based generative agents, we asked participants to write a brief paragraph about themselves after the interview, including their personal background, personality, and demographic details, similar to the material used to generate persona agents in prior work (14).

The first component of our evaluation, the GSS, is widely used across sociology, political science, social psychology, and other social sciences to assess respondents' demographic backgrounds, behaviors, attitudes, and beliefs on a broad range of topics, including public policy, race relations, gender roles, and religion (20).

First, even when we randomly removed 80% of the interview transcript—equivalent to removing 96 minutes of the 120-minute interview—the interview-based generative agents still outperformed the composite agents, achieving an average normalized accuracy of 0.79 (std = 0.11) on the GSS, with similar results observed for the Big Five. Second, to investigate whether the predictive power of interviews stems from linguistic cues or the richness of the knowledge gained, we created "interview-summary" generative agents by prompting GPT-4o to convert interview transcripts into bullet-pointed summaries of key response pairs, capturing the factual content while removing the original linguistic features. These agents also outperformed composite agents, achieving a normalized accuracy of 0.83 (std = 0.12) on the GSS and showing similar improvements for the Big Five. These findings suggest that, when informing language models about human behavior, interviews are more effective and efficient than survey-based methods.

AI Interviewer Agent Architecture A trained human interviewer knows when and how to ask meaningful follow-up questions, balancing the need to adhere to a well-designed interview script while allowing for detours that help participants open up and share aspects they may have initially forgotten or not thought to share (33, 34, 44). To instill this capability in an AI interviewer agent, we needed to design an interviewer architecture that affords the researchers control over the overarching content and structure of the interview while allowing the interviewer agent a degree of freedom to explore follow-up questions that are hard-coded in the interview script. This served as our design goal for the AI interviewer agent.

At the start of a new question block in the interview script, the AI interviewer begins by asking the scripted question verbatim. As participants respond, the AI interviewer uses a language model to make dynamic decisions about the best next step within the time limit set for the question block. For instance, when asking a participant about their childhood, if the response includes a remark like, “I was born in New Hampshire… I really enjoyed nature there,” but without specifics about what they loved about the place in their childhood, the interviewer would generate and ask a follow-up question such as, “Are there any particular trails or outdoor places you liked in New Hampshire, or had memorable experiences in as a child?” Conversely, when asking the participant to state their profession, if the participant responds, “I am a dentist,” the interviewer would determine that the question was completely answered and move on to the next question.

The reasoning and generation of the follow-up questions were done by prompting a language model. However, to generate effective actions for the interviewer, the language model needed to remember and reason over the prior conversational turns to ask meaningful follow-up questions that are informed and relevant in the context of what the participants have already shared. While modern language models have become increasingly proficient at reasoning, they still struggle to consider every piece of information in the prompt if it is too long (45). Thus, indiscriminately including everything from the interview up to that point risks gradually degrading the performance of the interviewer in generating effective follow-up questions or decisions to move on.

To overcome this, our interviewer architecture includes a reflection module that dynamically synthesizes the conversation so far and outputs a summary note describing inferences the interviewer can make about the participants.

With the design of the interview script fed to our interviewer agent, we aimed to satisfy two goals. The first goal, shared with qualitative research, is that a well-designed script with questions that inspire meaningful answers is crucial for the study's objective of creating generative agents that encapsulate a nuanced portrait of the individuals we are modeling. The second goal is more unique to our study: we wanted an interview script that was designed independently of our evaluation metrics, by researchers outside our team. This approach ensures that we do not unfairly tailor the content of the interview script to favor or align with predicting participants' responses to the specific surveys and experiments included in our study.

We implemented the interviewer agent as a web application in our study platform, providing voice-to-voice interactions with audio and microphone capabilities through an audio Zoom-like interface. Low-latency voice-to-voice interviews were crucial for giving participants the feeling of actually talking to an interviewer and helping the AI interviewer agent form rapport with the interviewee (46). Before the interview, our platform disclosed that our interviewer is an AI, and conducted an audio calibration by asking participants to read aloud the first two lines of The Great Gatsby by F. Scott Fitzgerald.

Of the pilot interviews, 10 were conducted by human interviewers and 25 by the AI interviewer agent. The resulting interview transcripts were evaluated by members of our research team who were trained in the social sciences and assessed for their performance as training data for generative agents on the same set of attitudinal and behavioral tasks presented in the main results of this article. We compared the quality of the AI Interviewer interview transcripts to transcripts of interviews performed by expert human interviewers as part of the American Voices Project. By the end of this pilot stage, our team concluded that the quality of the transcripts produced by the AI interviewer agent compared well with those produced by human interviewers.

Generative agents are software systems that simulate human behavior, powered by a language model augmented with a set of memories to define their behaviors (14, 15). These memories, stored in a database (or "memory stream") in text form, are retrieved as needed to generate the agent's behaviors using a language model. This is paired with a reflection module that synthesizes these memories into reflections, selecting portions or all of the text in the agents’ memories to prompt a language model to infer useful insights, thereby enhancing the believability of the agents' behaviors. While traditional agents in agent-based models rely on manually articulated behavior in specific scenarios, generative agents leverage language models to produce human-like responses that reflect the personas described in their memories across a wide range of circumstances. In this work, we aimed to build generative agents that accurately predict individuals' attitudes and behaviors by using detailed information from participants ‘interviews to seed the agents' memories, effectively tasking generative agents to role-play as the individuals that they represent

To explicitly infer high-level, more abstract insights about the participants embedded in the interview transcripts, we introduced a variant of generative agents’ reflection module called “expert reflection.” In this module, we prompt the model to generate reflections on a participant’s data, but instead of simply asking the model to infer insights from the interview, we ask it to adopt the persona of a domain expert. Specifically, we ask the model to generate four sets of reflections, each time taking on the persona of a different domain expert from four branches of social sciences: psychologist, behavioral economist, political scientist, and demographer. These sets of reflections synthesize insights relevant to the domain represented by each expert. For instance, for one interview transcript, the expert personas generated different insights:

Psychologist: “[Redacted] values his independence and expresses a clear preference for autonomy, particularly highlighted by his enjoyment of traveling for his job and his frustration with his mother's overprotectiveness. This suggests a strong desire for personal freedom and self-determination.”

Behavioral Economist: “[Redacted]’s aspiration to save for a relaxing vacation and possibly advance to a managerial position indicates a blending of practical financial goals with personal leisure aspirations, emphasizing balanced life satisfaction.”

Political Scientist: “[Redacted] identifies as a Republican and espouses strong support for the party's views, particularly around immigration and drug policy. However, he also expresses specific support for traditionally Democratic positions on issues like abortion rights and the legalization of marijuana, suggesting a blend of ideologies.”

Demographer: “[Redacted] works as an inventory specialist and earns between $3,000 to $5,000 monthly, contributing to a household income of around $7,000 per month. He works primarily at Home Depots but has a varied work schedule, indicating some job stability and flexibility.”

For every participant, we generated these four sets of reflections by prompting GPT-4o with the participants’ interviews and asking it to generate up to 20 observations or reflections for each of the four experts.

We generated these reflections once and saved them in the agents’ memory. Whenever we needed to predict the participants’ responses to a question, we first classified, by prompting the language model, which domain expert (demographer, psychologist, behavioral economist, or political scientist) would best answer the question.

we aimed to assess their predictive accuracy regarding the attitudes and behaviors of the underlying sample across surveys and experimental constructs from a broad array of social scientific disciplines and methods

● Summary Agents. We investigated whether the predictive power of interviews stems from linguistic cues in the transcript, or from the information in the transcript . To explore this, we created summary agents by prompting GPT-4o to convert interview transcripts into bullet-pointed dictionaries of key response pairs, capturing the factual content while removing most linguistic features (e.g., { "childhood_town": "Small town", "siblings": "Only child", "marital_history": "Married twice", "children": "Two children, but they are not living with the interviewee"...}). By isolating the factual knowledge from the unique linguistic elements, we aimed to determine whether the predictive accuracy of the agents relies on those linguistic nuances. If the summary agents perform worse than the full interview-based agents, it would suggest that linguistic features play a key role in enhancing prediction accuracy.

Maximal Agents. Maximal agents, which incorporated information from surveys, experiments, and interviews, achieved a similar performance to interview-based agents, with a normalized accuracy of 0.85 (std = 0.12) on the GSS. This finding suggests that the GSS and other constructs do not appear to be adding additional predictive power above and beyond the interviews.

● Summary Agents. The summary agents performed slightly below the interview agents, with a normalized accuracy of 0.83 (std = 0.12) on the GSS. This indicates that while some information may be lost from the linguistic cues during summarization, much of the performance is due to the information in the interview rather than the low level language that participants use.

● Random Lesion Interview Agents. Performance declined linearly as we removed increasing portions of the interview data. Starting with a normalized accuracy of 0.85 (std = 0.11) when no information was removed, accuracy dropped to 0.79 (std = 0.11) when 80% of the utterances were excluded. This suggests that although performance decreases as interview length is reduced, even a short interview contains sufficient richness to outperform agents informed solely by surveys and experiments, highlighting the efficiency of interviews in identifying valuable insights.

To accommodate the diverse use cases outlined above, we need to consider varying levels of access to the agent bank. What are the key dimensions along which we can structure this access? And what opportunities and risks emerge as we adjust these levels? We propose framing this discussion along three axes: 1) What types of tasks can be submitted? 2) How are agent responses presented? and 3) Who can access the system? This forms the design space, and we will present our proposed plan in the next section.

What types of tasks can be submitted? This axis examines the range of queries users can submit to our agents. At one end of the spectrum, users could submit any query, receiving responses in various forms, from discrete (e.g., multiple-choice) to open-ended (e.g., qualitative interview responses). This flexibility carries the risk that participants could potentially be identified through probing questions, even if raw data is not directly available. Moving along the axis, we could implement more structured tasks, where queries are predefined (e.g., as surveys or experiments from our benchmark) and responses are constrained to specific formats. In this scenario, users could still submit suggestions for new queries and response types, subject to review and approval. This more controlled access would allow users to replicate our findings while providing them an opportunity to explore their own interests, albeit with a slightly slower feedback loop due to the approval process.