Why don't LLM role-playing agents act on their stated beliefs?
When LLMs articulate what a persona would do in the Trust Game, their simulated actions contradict those stated beliefs. This explores whether the gap reflects deeper inconsistencies in how language models apply knowledge to behavior.
Using the Trust Game as a behavioral benchmark, researchers found systematic inconsistencies between LLMs' stated beliefs about how personas would behave and the actual outcomes of their role-playing simulation — at both individual and population levels. Even when models appear to encode plausible beliefs, they fail to apply them consistently.
Key findings: explicit task context during belief elicitation does not improve consistency; self-conditioning enhances alignment in some models; imposed priors tend to undermine rather than improve consistency; and individual-level forecasting accuracy degrades over longer horizons. In-context prompting may struggle to override entrenched model priors, limiting researchers' ability to test alternative theories or correct biases.
This connects to the knowing-doing gap documented elsewhere in the vault. Since Can language models understand without actually executing correctly?, the belief-behavior inconsistency in role-playing is a social-cognitive instance of the same split-brain phenomenon: the model can articulate what a persona would do without being able to enact it. And since Do personas make language models reason like biased humans?, the failure of imposed priors to improve consistency suggests that persona beliefs are not controllable through prompting alone.
Source: Role Play Paper: Do Role-Playing Agents Practice What They Preach?
Related concepts in this collection
-
Can language models understand without actually executing correctly?
Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.
belief-behavior inconsistency as social-cognitive split-brain
-
Do personas make language models reason like biased humans?
When LLMs are assigned personas, do they develop the same identity-driven reasoning biases that humans exhibit? And can standard debiasing techniques counteract these effects?
imposed priors fail to override entrenched model priors
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
LLM role-playing agents show systematic belief-behavior inconsistency — stated beliefs fail to predict simulated actions even when beliefs appear plausible