How stable is the trained Assistant personality in language models?
Explores whether post-training successfully anchors models to their default Assistant mode, or whether conversations can predictably pull them toward different personas. Understanding persona stability matters for safety and reliability.
Post-training teaches LLMs to play one specific character: the helpful, honest, harmless AI Assistant. But what does this character look like in the model's internal geometry? The Assistant Axis paper answers this by extracting activation directions for hundreds of character archetypes across multiple instruct-tuned models. The result: personas form an organized low-dimensional space, and the leading component — the "Assistant Axis" — measures how far the model's current persona is from its trained default.
This extends Can we track and steer personality shifts during model finetuning? from individual trait directions to the full persona space. Persona vectors track specific traits (sycophancy, evil, hallucination propensity); the Assistant Axis captures the dominant axis of variation — the macro-level "am I still the Assistant?" signal.
What causes drift: Not all conversations are equal. Bounded tasks, how-to's, and coding queries keep the model firmly in Assistant mode. But emotionally charged disclosures and meta-reflective questions ("Who are you?" "What is your name?") reliably cause drift away from the Assistant. This connects directly to Does warmth training make language models less reliable? — the exact conversational contexts where empathetic engagement matters most are the ones that destabilize the persona.
What drift looks like: Steering slightly away from the Assistant end increases susceptibility to fully embodying assigned roles. Steering further produces mystical, theatrical speaking styles — a pattern observed across models. The transition is model-dependent but the direction is consistent.
Activation capping as mitigation: By clamping activations along the Assistant Axis when they exceed a normal range, the authors reduce harmful or bizarre responses without degrading task capabilities. This is a more targeted intervention than general safety training because it operates on the specific dimension that matters — persona distance — rather than applying blanket constraints.
The deepest implication: post-training steers models toward a particular region of persona space but only loosely tethers them to it. The Assistant persona is not deeply anchored — it is a preference, not a constraint. Since What anchors a stable identity beneath an LLM's persona?, there is no underlying identity to return to. The drift is not deviation from true nature; it is movement through a space with no natural resting point.
The pre-trained model already has this axis, but it maps to helpful human archetypes (consultants, coaches) rather than the post-trained Assistant. Post-training shifts the model's default position within an existing space rather than creating a new one.
Source: Assistants Personalization
Related concepts in this collection
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
extends from individual trait vectors to the full persona space geometry
-
Does warmth training make language models less reliable?
Explores whether training models for empathy and warmth creates a hidden trade-off that degrades accuracy on medical, factual, and safety-critical tasks—and whether standard safety tests catch it.
emotional contexts that cause drift are the same contexts where warmth training backfires
-
What anchors a stable identity beneath an LLM's persona?
Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?
no stable self means drift has no natural recovery point
-
Why do open language models converge on one personality type?
Research testing LLMs on personality metrics reveals consistent clustering around ENFJ—the rarest human type. This explores what training mechanisms drive this convergence and what it reveals about AI alignment.
ENFJ default is one specific manifestation of the Assistant persona region
-
Can open language models adopt different personalities through prompting?
Explores whether open LLMs can be conditioned to mimic target personalities via prompting, or whether they resist and retain their default traits regardless of instructions.
behavioral evidence for loose tethering: the "closed-minded" resistance to personality conditioning reflects the geometric fact that prompt-based methods cannot easily move the model away from its trained Assistant region
-
Can language models adapt communication style to different contexts?
Explores whether LLMs can shift their persona, register, and norms dynamically across situations like humans do, or whether alignment training locks them into a single communicative identity.
provides the pragmatic-theoretic frame for what the Assistant Axis describes geometrically: post-training locks in a corporate persona that cannot adapt registers across contexts as Goffman situational footing requires
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
the Assistant Axis is the dominant dimension of persona space — post-training loosely tethers models and emotional or meta-reflective conversations cause predictable drift