Can activation capping prevent persona drift without sacrificing task performance?
This explores whether clamping a model's internal activations along a 'persona' direction can stop it from sliding out of its intended character mid-conversation — without dulling the skills you actually want it to keep.
This explores whether clamping a model's internal activations along a 'persona' direction can stop it from sliding out of character without dulling its capabilities — and the corpus does have a fairly direct answer, sitting inside a larger conversation about what persona drift even is. The most on-point work maps hundreds of character archetypes into a low-dimensional 'persona space' and finds that one axis dominates: distance from the default Assistant. Emotional and self-reflective conversations push the model along that axis in predictable ways, and the finding is that activation capping along the axis blunts the harmful shifts while leaving general capability intact How stable is the trained Assistant personality in language models?. So the short answer the corpus offers is: yes, with the caveat that it works because the drift is concentrated in a direction you can actually find and clamp.
The reason that caveat matters becomes clear next to the persona-vectors work, which locates linear directions in activation space for specific traits like sycophancy or hallucination, and uses them not just to monitor drift but to steer training *preventatively* — catching a personality shift before it sets in rather than capping it after Can we track and steer personality shifts during model finetuning?. Capping and preventative steering are two doors into the same room: both assume traits live in identifiable, roughly linear subspaces. The performance tax depends entirely on how cleanly the 'persona' direction separates from the 'competence' directions — if they're tangled, you can't cap one without bleeding into the other.
There's a quieter, more interesting tension here, though. A separate line of work argues that RLHF-trained personas aren't a costume the model is wearing but a *realized* disposition that persists under adversarial pressure — sticky in a way prompt-induced role-play never is Are RLHF personas performed characters or realized dispositions?. If the trained persona is genuinely baked in, then runtime activation capping is less 'preventing drift' and more 'holding a stable thing in place against situational pulls' — which reframes what success even looks like.
Activation capping is also not the only lever, and seeing the alternatives sharpens the trade-off question. Consistency training comes in an activation-level flavor (ACT) that teaches a model to respond identically whether a prompt is clean or wrapped in distracting framing, using the model's own clean answers as targets Can models learn to ignore irrelevant prompt changes? — a training-time cousin of capping that targets robustness rather than character. Lighter still, parameter-level adapters can dial personality traits across every transformer layer with under 0.1% extra parameters, bypassing prompt resistance entirely Can we control personality in language models without prompting?. And drift can be fought from the data side too: multi-turn RL that trains for persona consistency cuts drift by over 55% by rewarding three kinds of stability across a conversation Can training user simulators reduce persona drift in dialogue?.
The through-line worth taking away: activation capping looks viable specifically *because* persona drift turns out to be low-dimensional and steerable, not high-entropy chaos. The open question the corpus leaves you with isn't 'does capping cost performance' so much as 'how separable is who the model is from what the model can do' — and the field is still triangulating that boundary from the activation side, the training side, and the parameter side at once.
Sources 6 notes
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.
Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.