Conversational AI Systems Psychology and Social Cognition

Why does supervised learning fail to enforce persona consistency?

Supervised learning trains models to generate good responses but never punishes contradictions. This note explores why explicit negative feedback is structurally necessary for dialogue agents to maintain consistent personas, and what training methods can provide it.

Note · 2026-02-22 · sourced from Personas Personality
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Building Persona Consistent Dialogue study identifies a structural limitation of supervised learning for persona-based chatbots: SL trains models to generate good responses but never explicitly punishes contradictory utterances. A model trained with SL can learn to produce persona-consistent responses in general while remaining insensitive to specific contradictions — because contradictions are never negatively reinforced.

Online RL can address this by rewarding consistency and punishing contradiction during generation. But online RL for dialogue is expensive: the model must continuously generate new samples, and accurate critic models must evaluate both consistency and fluency simultaneously. Without fluency constraints, RL training degenerates.

Offline RL offers a middle path:

The authors introduce VaRMI (Variance-Reducing MLE-Initialized importance sampling) to handle the high variance that offline RL typically suffers from.

The design principle is generalizable: any dialogue property that matters (factual accuracy, emotional consistency, persona adherence) requires explicit negative feedback in training, not just positive examples. SL's inability to punish is not a minor limitation — it's a structural gap that explains why persona consistency is hard to achieve through standard fine-tuning.

This connects to Does preference optimization damage conversational grounding in large language models? — both findings point to training method as the source of conversational failure. RLHF erodes grounding; SL fails to enforce consistency. The training pipeline shapes conversational behavior through what it optimizes and through what it fails to penalize.

Multi-turn online RL extension: The "Consistently Simulating Human Personas" paper extends the offline RL approach to online multi-turn RL, achieving over 55% inconsistency reduction. Three complementary metrics decompose drift into distinct types: prompt-to-line consistency (alignment with initial persona), line-to-line consistency (coherence with conversation history), and Q&A consistency (factual accuracy about persona). Using LLM-as-a-Judge to compute these metrics as continuous reward signals provides scalable automatic evaluation without human-annotated contradiction labels. The key architectural inversion: instead of training the task agent against a fixed user simulator, they fix the task agent and train the user simulator for consistency — treating simulated users as trainable agents rather than fixed environments. This also surfaces a specific RLHF problem: "RLHF fine-tuning often pushes LLMs to be helpful and harmless, thus adopting overly cheerful personas which can conflict with accurately simulating users who are depressed or disagreeable" (Can training user simulators reduce persona drift in dialogue?).


Source: Personas Personality, Conversation Agents

Related concepts in this collection

Concept map
15 direct connections · 146 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

persona consistency in dialogue requires explicit contradiction punishment — supervised learning never penalizes inconsistency while offline RL enables it