Psychology and Social Cognition

How stable is the trained Assistant personality in language models?

Explores whether post-training successfully anchors models to their default Assistant mode, or whether conversations can predictably pull them toward different personas. Understanding persona stability matters for safety and reliability.

Note · 2026-02-23 · sourced from Assistants Personalization
What makes therapeutic chatbots actually work in clinical practice? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Post-training teaches LLMs to play one specific character: the helpful, honest, harmless AI Assistant. But what does this character look like in the model's internal geometry? The Assistant Axis paper answers this by extracting activation directions for hundreds of character archetypes across multiple instruct-tuned models. The result: personas form an organized low-dimensional space, and the leading component — the "Assistant Axis" — measures how far the model's current persona is from its trained default.

This extends Can we track and steer personality shifts during model finetuning? from individual trait directions to the full persona space. Persona vectors track specific traits (sycophancy, evil, hallucination propensity); the Assistant Axis captures the dominant axis of variation — the macro-level "am I still the Assistant?" signal.

What causes drift: Not all conversations are equal. Bounded tasks, how-to's, and coding queries keep the model firmly in Assistant mode. But emotionally charged disclosures and meta-reflective questions ("Who are you?" "What is your name?") reliably cause drift away from the Assistant. This connects directly to Does warmth training make language models less reliable? — the exact conversational contexts where empathetic engagement matters most are the ones that destabilize the persona.

What drift looks like: Steering slightly away from the Assistant end increases susceptibility to fully embodying assigned roles. Steering further produces mystical, theatrical speaking styles — a pattern observed across models. The transition is model-dependent but the direction is consistent.

Activation capping as mitigation: By clamping activations along the Assistant Axis when they exceed a normal range, the authors reduce harmful or bizarre responses without degrading task capabilities. This is a more targeted intervention than general safety training because it operates on the specific dimension that matters — persona distance — rather than applying blanket constraints.

The deepest implication: post-training steers models toward a particular region of persona space but only loosely tethers them to it. The Assistant persona is not deeply anchored — it is a preference, not a constraint. Since What anchors a stable identity beneath an LLM's persona?, there is no underlying identity to return to. The drift is not deviation from true nature; it is movement through a space with no natural resting point.

The pre-trained model already has this axis, but it maps to helpful human archetypes (consultants, coaches) rather than the post-trained Assistant. Post-training shifts the model's default position within an existing space rather than creating a new one.


Source: Assistants Personalization

Related concepts in this collection

Concept map
13 direct connections · 113 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

the Assistant Axis is the dominant dimension of persona space — post-training loosely tethers models and emotional or meta-reflective conversations cause predictable drift