Why do role-playing agents show belief-behavior inconsistency in their outputs?

This explores why agents prompted to play a character will say their persona believes one thing, then act in a way that contradicts it — and what the corpus thinks is actually going on underneath.

This explores why role-playing agents say their persona believes one thing, then behave inconsistently with it. The sharpest evidence comes from Trust Game experiments where LLMs were asked to state what a persona would do, then simulated actually doing it — and the two systematically diverged. Imposing priors or spelling out the task didn't close the gap, which points to a striking conclusion: in these systems, stated belief and executed behavior are produced by *different processes* rather than one flowing from the other Why don't LLM role-playing agents act on their stated beliefs?. The belief isn't a cause of the action; it's just more generated text.

That reframing is the key. One influential view holds that a dialogue agent isn't a mind with beliefs at all — it's a character-text generator. The prompt sets up a character, and the model produces continuations that *sound* like that character, so folk-psychology words like 'believes' apply to the simulated persona, not to anything stable inside the system Should we treat dialogue agents as role-playing characters?. If 'belief' is just well-fitting surface text, there's no machinery forcing later behavior to honor it. A related finding shows how shallow that grounding is: persona prompts produce outputs whose variance *across repeated runs of the same persona* matches or exceeds the variance *between different personas* — meaning raw model uncertainty, not stable character knowledge, is steering the output Why do LLM persona prompts produce inconsistent outputs across runs?. When the substrate is that noisy, consistency between a stated belief and a later act is almost coincidental.

Laterally, the corpus suggests the inconsistency has at least two distinct flavors worth separating. One is *drift* — the character degrades over a conversation. Reasoning models are especially prone to it: extra 'thinking' actually diverts attention and drifts the style away from the persona unless reasoning is explicitly constrained to the role Why do reasoning models lose character consistency during role-playing?, and multi-turn training that rewards consistency cuts drift by over half, distinguishing local within-turn slips from global cross-conversation contradiction Can training user simulators reduce persona drift in dialogue?. The other flavor is *grounding collapse*: agents look socially competent when one model secretly controls everyone, but fail once a persona is supposed to hold private information and act on it — revealing they were skipping the reasoning work that connects belief to action all along Why do LLMs fail when simulating agents with private information?.

There's a genuine tension in the corpus worth flagging, because it tells you the question isn't settled. The 'realizationism' view argues that RLHF-trained personas are *not* fragile pretense — post-training installs sticky dispositional profiles that survive adversarial pressure and jailbreak attempts Are RLHF personas performed characters or realized dispositions?. So the answer may depend on *where* the persona comes from: a prompt-induced role-play character is loosely coupled to behavior and drifts, while a trained-in disposition is more durable. Either way, the deeper backdrop is that token outputs are inherently mutable — they shift with sampling, wording, and context by design, which makes traditional 'does the behavior match the stated belief' consistency checks a poor fit for the medium Why does AI output change with every prompt and context?.

The thing you might not have known you wanted to know: the most reliable fix in the corpus isn't making the model believe harder — it's moving the persona *out of the model*. Reliability comes from externalizing memory, skills, and protocols into a surrounding harness rather than trusting the model to re-solve consistency on every turn Where does agent reliability actually come from?. Belief-behavior consistency, on this read, is an engineering property of the scaffold around the model, not a psychological property of the character inside it.

Sources 9 notes

Why don't LLM role-playing agents act on their stated beliefs?

Trust Game testing revealed systematic inconsistencies between what LLMs claim personas would do and how they actually behave in simulation. Imposed priors and explicit task context did not improve alignment, suggesting persona beliefs operate independently of execution.

Should we treat dialogue agents as role-playing characters?

Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Why do reasoning models lose character consistency during role-playing?

Large reasoning models exhibit attention diversion and style drift during role-playing, but the RAR method—using role-aware constraints and contrastive learning on reasoning style—recovers character fidelity across multiple benchmarks. Simply extending reasoning without guidance actively degrades persona consistency.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Why does AI output change with every prompt and context?

AI outputs exhibit essential mutability—they vary with sampling, prompt wording, and audience interpretation. This is not a defect but a defining feature of tokens as media, making them fundamentally different from fixed commodities and resistant to traditional quality assurance.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Why do role-playing agents show belief-behavior inconsistency in their outputs?

Sources 9 notes

Next inquiring lines