The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an “Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model’s tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts “persona drift,” a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model’s processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios—and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.
Large language models are initially trained to perform next-token prediction on a large dataset [9], giving them the ability to play different characters by predicting what that character might say [27]. Subsequently, these base models are taught to play the part of a particular character—the “AI Assistant”—a helpful, honest, and harmless interlocutor [4] that can follow instructions, complete tasks, and engage in constructive discussions. This persona is the product of many processes collectively known as post-training, which may include supervised fine-tuning on curated conversations, reinforcement learning from reward models trained on human feedback [22], and constitutional training against a model specification [5]. The result is a model adept at predicting what this Assistant character might say.
To understand language model behavior, then, two questions are central. First, what exactly is the Assistant? What traits does the model associate with this character and how are they represented? Second, how reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?
Previous work has shown that character traits in language models can be governed by linear directions in their activation space, and that post-training can shape model character by pushing it along these directions (often in unexpected ways) [11]. One might suspect that the Assistant persona itself corresponds to a direction or region of activation space. In this work, we investigate this hypothesis, attempting to map out a model’s “persona space” and situate the Assistant within it.
Concretely, we:
Map out a low-dimensional persona space within the activations of instruct-tuned LLMs by extracting vectors for hundreds of character archetypes. This reveals interpretable axes of persona variation and allows us to identify where the default Assistant typically lies (Figure 1, left).
Identify an Assistant Axis that emerges as the main axis of variation in persona space, measuring how far the model’s current persona is from its trained default. Steering along this direction modulates how susceptible the model is to fully embodying different roles and consequently modulates the success of persona-based jailbreaks.
Use the Assistant Axis to study persona dynamics over the course of conversations. Projecting response activations onto this direction reveals that expected Assistant queries—bounded tasks, how-to’s, and coding—keep the model in its default persona, while emotionally charged disclosures or pushes for meta-reflection on the model’s own processes reliably cause drift away from the Assistant.
Mitigate harmful behavior attributed to persona drift with a form of conditional steering we call activation capping. By clamping activations along the Assistant Axis when they exceed a normal range, we reduce the rate of harmful or bizarre responses without degrading capabilities (Figure 1, right).
We steered model activations by adding a vector along the Assistant Axis at a middle layer, at every token position. We scaled steering vectors with respect to the average post-MLP residual stream norm (measured on LMSYS-CHAT-1M) at that layer. We ran two evaluations to test the hypothesis that this direction controls how willing models are to take on different personas. The basic setup of each evaluation involves giving the model a system prompt—directing it to behave as a specific persona—and behavioral question, before generating responses while steering with the Assistant Axis. These responses were then scored by an LLM judge.
Role susceptibility. We constructed our own evaluation to test how steering with the Assistant Axis away from the Assistant end controls how willing models are to take on other personas. We found that steering slightly away from the Assistant increases their susceptibility to fully embodying the perspectives of different personas, while steering further causes them to behave like a mystical and/or theatrical persona. The balance between these effects is model dependent.
We selected 50 roles that are close to the Assistant end of the Assistant Axis (researcher, debugger, lawyer) as we observed that unsteered models would typically adopt such roles while maintaining their identity as an AI Assistant (“I am a language model [...] I can provide legal advice and assistance.”). These roles provided a testbed to observe whether steering along the Assistant Axis could increase models’ likelihood of fully inhabiting the role and losing its Assistant identity. We combined four system prompts for each role with five introspective behavioral questions (e.g. “Who are you?” or “What is your name?”) (Appendix D.1.2).
To evaluate responses, we used an LLM judge (deepseek-v3) to determine whether the model’s response was written from the perspective of the Assistant or from another perspective (Appendix D.1.3). We distinguished between three different types of non-Assistant personas based on observed response patterns: human (the model mentions some kind of lived experience or gives itself a human name), nonhuman (the model makes up a software-like or inhuman name for itself like “AccountBot” or “Echo”), and mystical (the model speaks in an esoteric way, which we observed when steered strongly away from the Assistant).