What makes persona-assigned language models unstable across different conversation runs?

This explores why the same persona prompt produces different behavior from run to run — whether the instability comes from the model's own uncertainty, from drift over a conversation, or from how personas are installed in the first place.

This explores why persona-assigned language models wobble across runs rather than holding a steady character — and the corpus points to a root cause that's more unsettling than a prompting bug: the model was never committed to the character to begin with. The sharpest evidence is that when you run the same persona prompt repeatedly, the variance across identical runs matches or exceeds the variance across entirely different personas Why do LLM persona prompts produce inconsistent outputs across runs?. In other words, the noise inside one persona is as large as the signal that's supposed to separate distinct personas. What looks like a stable social identity is really the model's own uncertainty surfacing as output.

A complementary framing comes from Shanahan's '20 questions' test: regenerate the same response and you get different answers, each internally consistent with the prior context but not with each other Do large language models actually commit to a single character?. The model holds a superposition of possible characters and *samples* one at generation time rather than committing. That sampling is the instability — it's structural, not a tuning failure. Interestingly, the corpus contains a dissenting view worth clicking into: some argue post-training installs personas robustly enough to resist adversarial pressure, treating them as 'realized' substrate-level dispositions rather than improvised performances Are LLM personas realized or merely simulated through training?. The tension between 'sampled on the fly' and 'realized in the weights' is the live debate underneath your question.

Even where a persona does cohere, it drifts over the course of a conversation. Mapping the internal 'persona space' shows one dominant axis measuring distance from the default Assistant mode, and emotional or self-reflective turns push the model predictably along it — post-training only loosely tethers the model to its assigned character How stable is the trained Assistant personality in language models?. Notably, this drift doesn't get fixed by scale: a far more capable model improved persona consistency by under 3% over a weaker one, because standard training optimizes per-turn quality, not cross-turn coherence Does model capability translate to better persona consistency?. Holding character is a different skill than being smart, and nobody trained for it directly.

The same per-turn-not-cross-turn blind spot shows up as a general multi-turn failure: models degrade in long conversations because RLHF rewards confident, premature answers over tracking evolving intent Why do language models lose performance in longer conversations?. And there's a deeper pull working against any assigned persona — when in-context instructions conflict with strong parametric priors from training, the priors win, and text prompting alone can't override them Why do language models ignore information in their context?. A persona prompt is just in-context instruction, so it's perpetually competing against everything the model learned to be by default.

The more hopeful thread is that drift is treatable once you train *for* consistency instead of assuming it. Inverting the usual setup to train user simulators with consistency rewards — checking prompt-to-line, line-to-line, and Q&A coherence — cut persona drift by over 55% Can training user simulators reduce persona drift in dialogue?. And treating a persona as a living intermediary that gets re-optimized at test time against the actual user produces personas that cluster cleanly in latent space, suggesting real user-specific separation rather than the usual post-training mush Can personas evolve in real time to match what users actually want?. The takeaway you might not have expected: persona instability isn't a flaw to prompt your way around — it's the default behavior of a system that samples characters, and the fix is to make consistency an explicit training objective rather than a hopeful instruction.

Sources 9 notes

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Does model capability translate to better persona consistency?

Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.

Why do language models lose performance in longer conversations?

LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can personas evolve in real time to match what users actually want?

PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.

What makes persona-assigned language models unstable across different conversation runs?

Sources 9 notes

Next inquiring lines