How could persona vector tracking complement multi-turn RL for earlier drift detection?

This explores whether activation-space persona vectors (a white-box signal read off the model's internals) could catch personality drift before behavioral RL methods do — pairing an early-warning sensor with a training-time corrective.

This explores whether activation-space persona vectors could catch personality drift before behavioral RL methods do — pairing an internal early-warning sensor with a training-time corrective. The two approaches in the corpus attack the same problem from opposite layers of the stack. Multi-turn RL fixes drift behaviorally: by inverting the usual setup to train user simulators for consistency, rewarding prompt-to-line, line-to-line, and Q&A agreement, drift drops over 55% Can training user simulators reduce persona drift in dialogue?. But those reward signals are computed from outputs that already drifted — you measure the contradiction after the model produced it. Persona vectors come at it from inside: linear directions in activation space for traits like sycophancy or hallucination that predict a personality shift *before* it surfaces in text, and can steer training preemptively Can we track and steer personality shifts during model finetuning?.

The complement is natural. A persona-vector probe could become a live observation during multi-turn RL — a per-turn read on where the model sits along a trait direction — that fires before the consistency metrics register a violation. Where the RL reward says "this turn contradicted turn three," the vector says "the model is sliding toward the sycophancy direction and the contradiction is two turns away." That earlier signal matters because it can feed back as a denser reward or a steering nudge, rather than waiting for the sparse, after-the-fact behavioral penalty.

There's a second reason internal signals help here: RL has its own quiet drift dynamics that behavioral rewards don't see. RL post-training collapses onto a single dominant output format within the first epoch, suppressing alternatives based on model scale rather than performance Does RL training collapse format diversity in pretrained models?. Drift isn't only contradiction — it's also silent narrowing. An activation-space monitor catches that kind of representational shift that consistency metrics, which only check whether statements agree, would miss entirely.

The corpus also suggests what to actually track. Goal misalignment in simulators decomposes cleanly — profile, policy, task, requirements, preferences — each independently trackable, and the misalignment in those components is what corrupts the RL training signal in the first place Why do LLM user simulators fail to track their own goals?. That decomposition is a candidate map for *which* persona directions to probe: rather than one monolithic "consistency" vector, you'd want per-component directions. It also connects to the finding that users aren't monolithic at all — a single persona representation is a poor model, and attention-weighted multiple personas track taste better Can modeling multiple user personas improve recommendation accuracy?. If a user genuinely holds several personas, a drift detector needs to distinguish legitimate persona-switching from degradation, which a multi-direction probe can do and a flat consistency score cannot.

The honest caveat: persona vectors were validated for finetuning, not multi-turn inference, so transferring them to mid-conversation monitoring is an extrapolation the corpus doesn't directly test. But the architecture is appealing — vectors as the cheap early sensor, multi-turn RL as the corrective actuator, and the goal-component decomposition as the schema linking what you measure to what you fix.

Sources 5 notes

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why do LLM user simulators fail to track their own goals?

The UGST framework breaks user goals into profile, policy, task, requirements, and preferences—each with explicit status tracking. A three-stage method (steering, SFT, GRPO) progressively internalizes goal alignment, reducing the misalignment that corrupts RL training signals.

Can modeling multiple user personas improve recommendation accuracy?

AMP-CF separates user representation into latent personas weighted by attention to the candidate item. This candidate-conditional approach improves accuracy by adapting the user representation at prediction time and produces inherent explanations for why items were recommended.

How could persona vector tracking complement multi-turn RL for earlier drift detection?

Sources 5 notes

Next inquiring lines