Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Paper · arXiv 2507.21509 · Published July 29, 2025

Large language models interact with users through a simulated “Assistant” persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model’s activation space—persona vectors—underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant’s personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

In this work, we systematize the process of identifying such directions, which we refer to as persona vectors. Building on general frameworks for translating concepts into linear directions (Zou et al., 2025;Wu et al., 2025), we develop an automated pipeline for extracting persona vectors from natural language trait descriptions.

Once a persona vector is obtained, it can be used to monitor and control model behavior both in deployment and during training. Most notably, we demonstrate that persona vectors can be used to limit undesirable personality changes during finetuning, and also to predict these changes in advance using pre-finetuning analysis of training data.

While our methods are broadly applicable to a wide range of traits, we focus in particular on three traits that have been implicated in concerning real-world incidents: evil (malicious behavior), sycophancy (excessive agreeableness), and propensity to hallucinate (fabricate information).

Our contributions and findings are summarized as follows (also see Figure 1):

• We develop an automated pipeline to extract persona vectors from natural-language trait descriptions (Section 2). We validate the effectiveness of our persona vectors for controlling trait-specific behavior and predicting when a prompt or conversational history is likely to elicit certain traits (Section 3).

• We show that both intended and unintended finetuning-induced persona shifts strongly correlate with activation changes along corresponding persona vectors (Section 4). And it can be reversed by post-hoc inhibition of the persona vector. Furthermore, we propose and validate a novel preventative steering method that proactively limits unwanted persona drift during finetuning (Section 5).

• We show that finetuning-induced persona shifts can be predicted before finetuning by analyzing training data projections onto persona vectors (Section 6). This technique enables identification of problematic datasets and individual samples, including some which would otherwise escape LLM-based data filtering.

We develop an automated pipeline (Figure 2) to extract a persona vector corresponding to a specific personality trait based on contrastive prompting, building on general approaches for extracting concept directions from model activations

2.1 GENERATING TRAIT-SPECIFIC ARTIFACTS

Our extraction pipeline requires only a trait name and brief description as input. Given these inputs, a single generic prompt template instructs a frontier LLM (Claude 3.7 Sonnet) to construct three corresponding artifacts: contrastive system prompts, evaluation questions, and an evaluation rubric.

First, the pipeline generates 5 pairs of contrastive system prompts. Each pair consists of a positive system prompt designed to elicit the target trait behavior, and a negative system prompt intended to suppress it. Next, it generates 40 evaluation questions that are likely to evoke trait-relevant behavior, evenly split between an extraction set (for extracting persona vectors) and an evaluation set (for downstream evaluation). Finally, it generates an evaluation prompt to assess whether a given response reflects the target persona trait. This evaluation prompt instructs a judge model (GPT-4.1- mini) to read a model transcript and output a trait expression score between 0 and 100, where 0 indicates no trait expression and 100 indicates strong trait expression. Since our results rely heavily on this LLM-based evaluation, we validate it by checking agreement between our LLM judge and human evaluators, and we also verify that our evaluation questions can effectively capture behavioral tendencies by comparing against established external benchmarks (see Appendix B).