Do Role-Playing Agents Practice What They Preach? Belief-Behavior Consistency in LLM-Based Simulations of Human Trust
As large language models (LLMs) are increasingly studied as role-playing agents to generate synthetic data for human behavioral research, ensuring that their outputs remain coherent with their assigned roles has become a critical concern. In this paper, we investigate how consistently LLMbased role-playing agents’ stated beliefs about the behavior of the people they are asked to role-play (“what they say”) correspond to their actual behavior during role-play (“how they act”). Specifically, we establish an evaluation framework to rigorously measure how well beliefs obtained by prompting the model can predict simulation outcomes in advance. Using an augmented version of the GENAGENTS persona bank and the Trust Game (a standard economic game used to quantify players’ trust and reciprocity), we introduce a belief-behavior consistency metric to systematically investigate how it is affected by factors such as: (1) the types of beliefs we elicit from LLMs, like expected outcomes of simulations versus task-relevant attributes of individual characters LLMs are asked to simulate; (2) when and how we present LLMs with relevant information about Trust Game; and (3) how far into the future we ask the model to forecast its actions. We also explore how feasible it is to impose a researcher’s own theoretical priors in the event that the originally elicited beliefs are misaligned with research objectives. Our results reveal systematic inconsistencies between LLMs’ stated (or imposed) beliefs and the outcomes of their role-playing simulation, at both an individual- and population-level. Specifically, we find that, even when models appear to encode plausible beliefs, they may fail to apply them in a consistent way. These findings highlight the need to identify how and when LLMs’ stated beliefs align with their simulated behavior, allowing researchers to use LLM-based agents appropriately in behavioral studies.
We illustrate our general framework with a Trust Game case study (Berg et al., 1995), a standard benchmark for LLM role-playing biases (Wei et al., 2024; Xie et al., 2024). The Trust Game offers quantifies interpersonal trust as the amount of money the first player (the Trustor) chooses to send to the second player (the Trustee). We elicit the model’s beliefs about how individuals with specific personas or populations with shared characteristics would act, then have the model role-play as the Trustor, allowing direct comparison of stated beliefs with actual behaviors. Our findings suggest systematic belief-behavior inconsistencies: explicit task context during belief elicitation does not appear to improve consistency, selfconditioning enhances alignment in some models while imposed priors tend to undermine it, and individual-level forecasting accuracy tends to degrade over longer horizons.
We present a framework that elicits a model’s beliefs through targeted prompts to measure belief-behavior consistency in role-play simulations at two levels of analysis. First, at the population level, we quantify consistency by computing the correlation between persona attributes and simulated statistical behaviors aggregated across all simulated participants. Second, at the individual level, we test an LLM’s capacity to predict the future actions of a specific simulated member of the population. In both cases, we test whether querying the model’s own expectations can flag misaligned beliefs before they lead to errors in large-scale synthetic data. We also examine three design choices: how much background context we give the model when eliciting beliefs, which outcomes we ask it to predict, and how far into the future we ask it to forecast its actions—and how each choice affects belief-behavior consistency
Limits of in-context conditioning for controllability. While self-conditioning improves consistency in some Llama models, imposed priors tend to undermine it across architectures. This suggests a potential limitation: in-context prompting may struggle to override entrenched model priors, which could limit researchers’ ability to test alternative theories or correct biases. Future work might explore knowledge editing (Wang et al., 2023a; Orgad et al., 2024) or inference-time steering (Li et al., 2023; Lamb et al., 2024; Minder et al., 2025) for more robust belief control.
We investigate belief-behavior consistency in LLM-based role-playing agents using the Trust Game, revealing systematic inconsistencies between models’ stated beliefs and simulated behaviors at both population and individual levels. Our evaluation framework identifies these issues before costly deployment by eliciting beliefs as a diagnostic tool. Key findings show that providing task context during belief elicitation does not improve consistency, selfconditioning helps some models while imposed priors generally undermine alignment, and forecasting accuracy degrades over longer horizons. These results highlight fundamental limitations in current LLM role-playing approaches and emphasize the need for robust internal consistency evaluation before using these systems as scientific instruments.