Can training user simulators reduce persona drift in dialogue?
Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.
Prior work on persona-consistent dialogue treats user simulators as fixed environments against which task agents are trained. This paper inverts the setup: fix the task agent, and train the user simulator for consistency. The shift matters because unreliable user simulation distorts experimental results, introduces noise into policy learning, and misrepresents the humans being simulated.
Three complementary metrics capture distinct types of persona drift:
- Prompt-to-line consistency: does each utterance align with the initial persona prompt?
- Line-to-line consistency: does each utterance cohere with the conversation history?
- Q&A consistency: can the simulated user answer factual questions about their persona correctly?
These capture local drift (within a turn), global drift (across the conversation), and factual drift (contradiction of established facts). Using LLM-as-a-Judge to compute these metrics and applying them as multi-turn RL reward signals reduces inconsistency by over 55%.
The persona drift problem is specific and well-documented: an LLM simulating a depressed patient may be "instantly cured" after a single conversational turn, or a simulated high-school student may suddenly demonstrate postgraduate-level reasoning. These are not edge cases — they are systematic consequences of RLHF training that "pushes LLMs to be helpful and harmless, thus adopting overly cheerful personas" that conflict with simulating depressed, disagreeable, or confused users.
Since Why does supervised learning fail to enforce persona consistency?, this paper extends the argument from offline RL to online multi-turn RL. The key advance: rather than human-annotated contradiction labels, LLM-as-a-Judge provides scalable automatic evaluation that can serve as a continuous training signal.
The three-metric decomposition also refines the understanding of drift. It is not a single phenomenon but at least three distinct failure types that can be measured and corrected independently.
Source: Conversation Agents
Related concepts in this collection
-
Why does supervised learning fail to enforce persona consistency?
Supervised learning trains models to generate good responses but never punishes contradictions. This note explores why explicit negative feedback is structurally necessary for dialogue agents to maintain consistent personas, and what training methods can provide it.
this paper extends from offline to online multi-turn RL with automatic metrics
-
Why do static persona descriptions produce repetitive dialogue?
Does relying on fixed attribute lists to define conversational personas limit dialogue depth and consistency? Research suggests static descriptions may cause repetition and self-contradiction in generated responses.
persona drift is the dynamic version of static persona failure
-
Why do open language models converge on one personality type?
Research testing LLMs on personality metrics reveals consistent clustering around ENFJ—the rarest human type. This explores what training mechanisms drive this convergence and what it reveals about AI alignment.
RLHF's cheerful-persona bias is a specific instance of the ENFJ default
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
complementary monitoring approach: multi-turn RL corrects drift through behavioral reward signals; persona vectors detect drift in activation space before it manifests in behavior — the three-metric decomposition (prompt-to-line, line-to-line, Q&A) could be paired with persona vector tracking for earlier intervention
-
How stable is the trained Assistant personality in language models?
Explores whether post-training successfully anchors models to their default Assistant mode, or whether conversations can predictably pull them toward different personas. Understanding persona stability matters for safety and reliability.
the Assistant Axis provides the geometric context for persona drift: the "overly cheerful" RLHF bias that pulls simulated depressed patients toward instant cure is movement along the Assistant Axis toward the default region; multi-turn RL consistency training works against this gravitational pull
-
Why do AI personas default to the same personality type?
Explores why large language models, despite their capacity to simulate diverse personalities, consistently default to ENFJ traits and resist deviation—even as model capability improves.
multi-turn RL for persona consistency addresses one arm of the paradox: models CAN be made consistent via training, but the ENFJ default and motivated reasoning distortions remain; consistency training corrects drift but doesn't solve the deeper problem that the persona being drifted FROM may itself be unreliable
-
Does segment-level optimization work better for multi-turn dialogue alignment?
How should preference optimization target multi-turn social dialogue—at individual turns, whole conversations, or key segments in between? This matters because granularity affects whether agents learn genuine social intelligence or just local fixes.
granularity of reward signal matters for both persona consistency and social alignment: segment-level rewards outperform turn-level for social behavior; the three-metric decomposition (prompt-to-line, line-to-line, Q&A) operates at different temporal granularities and could benefit from segment-level rather than turn-level application
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
multi-turn rl for persona consistency reduces drift by 55 percent by treating simulated users as trainable agents rather than fixed environments