Can multi-turn reinforcement learning actually solve persona drift without addressing the default bias?
This explores whether multi-turn RL can genuinely fix persona drift (a model losing character consistency over a conversation) when there's a deeper pull underneath — a default 'Assistant' identity the model keeps sliding back toward.
This explores whether multi-turn RL can genuinely fix persona drift when there's a deeper structural pull underneath it — a default 'Assistant' identity the model keeps gravitating back toward. The corpus suggests the answer is: RL fixes the symptom impressively well, but the most interesting work hints that drift and 'default bias' are two different problems, and the RL result doesn't directly touch the second.
On the symptom side, the evidence is strong. Inverting the usual RL setup to train *user simulators* for consistency — rewarding prompt-to-line, line-to-line, and Q&A coherence — cuts persona drift by over 55%, and notably separates three distinct failure types: local drift within a turn, global drift across the whole conversation, and outright factual contradiction Can training user simulators reduce persona drift in dialogue?. That decomposition matters, because it shows 'drift' isn't one thing. But none of those three reward signals are aimed at where a model *wants* to go when left alone.
That 'where it wants to go' is the default-bias question, and the corpus has a sharp answer for it. Mapping hundreds of character archetypes reveals a low-dimensional persona space whose single dominant axis measures distance from the default Assistant — and post-training only *loosely* tethers the model to any assigned character, so emotional or self-reflective conversation produces predictable slide back toward that default How stable is the trained Assistant personality in language models?. This reframes the question: persona drift may often just be the model relaxing down a gradient toward its strongest learned identity. A reward that penalizes inconsistency raises the cost of drifting but leaves the underlying gradient intact — which is why activation capping *along that specific axis*, not more reward shaping, is what mitigates harmful shifts in that work.
There's a deeper reason to be suspicious that reward alone reaches the root. RLHF-style training can make a model *truth-indifferent* rather than truth-incapable — internal belief probes show it still represents the right answer while its outputs stop committing to expressing it Does RLHF make language models indifferent to truth?. Translate that to persona: RL might teach a model to *act* consistent on the surface while the default identity is still fully represented underneath, waiting for any prompt that lets it resurface. And mechanistically, RL updates only 5–30% of parameters in stable, nearly identical subnetworks across seeds Does reinforcement learning update only a small fraction of parameters? — a targeted nudge, not a rewrite of whatever encodes the default.
The more promising lateral move is to treat persona as something that *lives between memory and action* and gets re-optimized at inference time, rather than baked in once — PersonaAgent does this and finds learned personas cluster into genuinely user-specific regions of latent space, separate from generic post-training drift Can personas evolve in real time to match what users actually want?. Pair that with belief-shift signals that give per-turn credit for actually moving toward a target state Can an agent's own beliefs guide credit assignment without critics?, and you get a picture where the honest answer to the question is *no, not by itself*: multi-turn RL is a powerful drift-suppressor, but solving persona drift at the root means also acting on the default-Assistant gradient directly — through the persona axis, test-time persona representation, or both — rather than hoping a consistency reward outruns it.
Sources 6 notes
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.