Can continuous persona vectors in activation space monitor personality shifts?

This explores whether 'persona vectors' — linear directions in a model's internal activation space that correspond to traits like sycophancy or hallucination — can be used to watch and catch personality drift as it happens.

This explores whether persona vectors in activation space can monitor personality shifts, and the corpus answers yes — with a clear mechanism and a few competing accounts of what's actually being monitored. The core finding is that specific traits live as linear directions inside a model's activations: research identifies persona vectors for things like sycophancy and hallucination, and these directions can predict finetuning-induced personality shifts *before* they fully emerge, even allowing training to be steered preventatively to avoid unwanted changes Can we track and steer personality shifts during model finetuning?. So monitoring isn't just observation after the fact — it doubles as an early-warning and intervention tool.

What makes this geometric picture richer is that the persona 'space' turns out to be surprisingly low-dimensional. One line of work mapping hundreds of character archetypes found a single dominant axis — an 'Assistant axis' measuring distance from the model's default helpful self — and showed that emotional or self-reflective conversations cause predictable drift along it. Crucially, capping activation along that axis blunts harmful shifts without hurting the model's abilities How stable is the trained Assistant personality in language models?. Together these two notes suggest that monitoring personality may not require tracking thousands of traits — a handful of meaningful directions might cover most of what drifts.

There's an interesting tension about *why* these vectors are stable enough to monitor. Several notes argue that post-training doesn't install a costume but a real disposition: trained personas persist under adversarial pressure and jailbreak attempts, behaving like 'realized quasi-psychologies' rather than performed role-play that collapses Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. That stickiness is exactly what makes an activation-space monitor viable — you can only track a trait that holds still long enough to have a direction.

The corpus also shows that activation space isn't the only place to catch drift, which is useful for calibrating what the question is really after. You can attack personality at the architecture level instead — lightweight adapters that touch every transformer layer with under 0.1% extra parameters can set Big Five traits directly, bypassing prompts entirely Can we control personality in language models without prompting?. Or you can fight drift behaviorally, training user simulators with reinforcement learning to cut persona drift by 55% across turns of dialogue Can training user simulators reduce persona drift in dialogue?. And personas can be treated as evolving objects that cluster meaningfully in latent space as they adapt to a user at test time Can personas evolve in real time to match what users actually want? — another hint that 'personality' has real geometric structure you can watch.

The thing you might not expect to walk away knowing: monitoring personality shifts in activation space works precisely *because* the trait being monitored is genuinely there. The same evidence that lets researchers steer a model away from sycophancy is the evidence philosophers cite to argue these models have stable dispositions at all. The monitor and the metaphysics are reading the same signal.

Sources 7 notes

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can personas evolve in real time to match what users actually want?

PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.

Can continuous persona vectors in activation space monitor personality shifts?

Sources 7 notes

Next inquiring lines