Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

Paper · arXiv 2511.00222 · Published October 31, 2025

Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics—prompt-to-line consistency, line-to-line consistency, and Q&A consistency—that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multiturn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent, faithful, and trustworthy simulated users.

However, this practice is not without risk. When LLMs poorly simulate the behaviors of real human subjects, they can reinforce misconceptions [69], misinform downstream systems [54], or produce misleading insights about human behavior [50]. Over-reliance on flawed simulations may give a false sense of alignment or generalization, particularly in sensitive domains like mental health or education. These limitations underscore the importance of critically evaluating how well LLMs maintain coherent and faithful human personas over time. From abrupt changes in persona, contradictions with earlier statements, or sudden stylistic changes within a single conversation [47], LLMs often suffer from inconsistencies. For example, an LLM-simulated patient intended to portray depression might, after a single conversational turn, be instantly “cured” and shift to a cheerful demeanor [57], or an LLM tasked with emulating a high-school student might suddenly demonstrate reasoning skills or vocabulary characteristic of a postgraduate researcher [57]. These breakdowns are not merely superficial; they pose fundamental challenges for downstream applications. To ensure that an LLM powered therapist or customer-support agent behaves as intended, we must accurately simulate how a human user would respond. This is important not only in zero-shot prompting settings but also when simulating humans as environment models in reinforcement learning (RL), where consistent and predictable responses are crucial for agent training [63, 2]. In all these contexts, unreliable or incoherent dialogue can distort experimental results, introduce noise into policy learning, reduce the realism of simulated interactions, and ultimately misrepresent the individuals to be simulated. To address this, we shift from treating user simulators as fixed environments to viewing them as adaptive agents that can be systematically improved for stronger internal consistency. By improving the stability and realism of simulated users, we create more reliable conditions for training and evaluating downstream task agents. Prior work has taken steps toward defining and improving consistency through evaluating logical reasoning capabilities [25], assessing persona conditioning [73] in dialogue, improving pragmatic self-awareness through prompting [28], and applying offline reinforcement learning with human-labeled contradictions [64]. However, existing approaches often rely on narrow, task-specific definitions of consistency, require costly annotations, and fail to capture behaviors seen in open-ended conversation.

In this paper, we introduce a novel framework for evaluating and improving consistency in LLM generated dialogue using multi-turn reinforcement learning. Maintaining consistency is challenging: it requires models to preserve subtle traits—tone, identity, beliefs—over long contexts, which LLMs are known to struggle with [24]. In addition, RLHF fine-tuning often pushes LLMs to be helpful and harmless, thus adopting overly cheerful personas [48] which can conflict with accurately simulating users who are depressed or disagreeable. To address these challenges, we formulate three complementary metrics: 1) prompt-to-line consistency, 2) line-to-line consistency, and 3) consistency based on accuracy on a questionnaire and validate each against human judgments. We then compute these metrics using LLM-as-a-Judge and leverage them as rewards to fine-tune LLMs via multi-turn reinforcement learning with three simulated user roles: an open-ended conversation partner, a student, and a patient seeking mental health counseling. This approach enables persona-specific fine-tuning that steers the model away from Reinforcement Learning from Human Feedback (RLHF) defaults and toward consistent, context-sensitive behavior. Our experiments show that models optimized in this way reduce inconsistency by over 55%, paving the way for more faithful LLM-based simulations in social science and RL pipelines.

We generate dialogues by simulating multi-turn conversations between two LLM agents serving as Usim and the Task Agent, both of which are provided with a task specific prompt defining the role of the agent. Usim is additionally provided with a background prompt with details on the agent’s persona, characteristics, strategy, and behavior. Next, we leverage a separate LLM-as-a-Judge [75] to assign scalar consistency scores to each utterance in the dialogue for Usim. Finally, we use these metrics as training signals, in addition to their prior use as evaluation tools, to fine-tune models via multi-turn reinforcement learning and reduce inconsistency, which we show in Section 5.

To evaluate and improve conversational consistency in multi-turn dialogue systems, we define three complementary metrics that capture distinct forms of inconsistency—both local (within a turn or utterance) and global (across the dialogue)—with respect to a system’s initial prompt, the dialogue history, and an interpretable ground truth.

2.3 Techniques for Improving Consistency

Prior work has explored strategies to improve persona and behavioral consistency in dialogue agents. One common approach is to condition generation on brief backstories or persona summaries [49, 73], which can enforce character traits within a singular exchange but struggle to maintain coherence over extended multi-turn interactions [24]. Pragmatic self-monitoring methods introduce mechanisms such as an ‘imagined listener’ or chain-of-thought feedback to help models detect and revise contradictions during generation [28] but do not perform any training. Reinforcement learning (RL) has been proposed as a way to improve long-term consistency by using reward signals based on human preferences or behavioral objectives. However, existing applications of RL are limited. For instance, [64] applies offline RL using a small set of human-labeled consistency preferences, which restricts scalability and generalization. In contrast, we use modern multi-agent RL techniques that have been shown to have strong performance in alignment and preference optimization, including online PPO [61] and LLM-as-a-judge [7, 75] to compute consistency metrics as the reward signal respectively.

3 Defining, Evaluating and Improving Consistency for LLMs

We introduce a framework for evaluating and improving consistency in multi-turn dialogue. The approach consists of three stages: (1) background-conditioned dialogue generation between two LLM agents, (2) consistency evaluation with three consistency metrics via LLM-as-a-Judge framework, and (3) fine-tuning of the simulated user with consistency metrics via multi-turn reinforcement learning. Following convention in task-oriented dialogue systems [60, 33], we refer to the simulated human agent as the User Simulator (Usim) and the policy agent as the Task Agent. In typical reinforcement learning setups, the Task Agent is the trained policy, while Usim serves as a fixed environment model of human behavior. In this work, we invert this setup: we fix the Task Agent as an LLM-powered dialogue policy, and focus on improving the consistency of Usim. Inverting the typical reinforcement learning setup draws attention to a crucial but underexplored component: the simulated human user. Whereas prior work treats user simulators as fixed environments, we treat them as trainable agents whose coherence and realism can be systematically improved. Enhancing the consistency of Usim enables more reliable training and evaluation of downstream task agents that interact with them