Psychology and Social Cognition

Can we track and steer personality shifts during model finetuning?

This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.

Note · 2026-02-22 · sourced from Personas Personality
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Persona Vectors paper identifies linear directions in LLM activation space — "persona vectors" — that correspond to specific personality traits. The method is automated: given only a trait name and brief description, a pipeline generates contrastive system prompts, evaluation questions, and rubrics using a frontier LLM, then extracts the persona vector from model activations.

The key contributions cascade:

  1. Monitoring at deployment: Persona vectors track fluctuations in the Assistant's personality in real-time. A sycophancy vector, for instance, can detect when conversational context is pushing the model toward excessive agreeableness.

  2. Predicting finetuning shifts: Both intended and unintended personality changes after finetuning strongly correlate with shifts along the corresponding persona vectors. This means personality drift is not random — it moves along interpretable directions.

  3. Post-hoc correction: Personality shifts can be reversed by inhibiting the persona vector after finetuning.

  4. Preventative steering: A novel method proactively limits unwanted persona drift during finetuning, not just after.

  5. Training data analysis: Projecting training data onto persona vectors predicts which datasets — and which individual samples — will produce undesirable personality changes. This catches problematic samples that LLM-based data filtering misses.

The three traits studied — evil (malicious behavior), sycophancy (excessive agreeableness), and hallucination propensity (fabrication) — have all been implicated in real-world incidents, making the practical stakes concrete.

PsychAdapter extends this beyond safety-critical traits to the full Big Five personality space. Since Can we control personality in language models without prompting?, adapters at every transformer layer achieve fine-grained Big Five trait control with <0.1% additional parameters — and critically, this works across multiple model architectures (not just one model family). Where persona vectors identify linear directions for specific traits, PsychAdapter demonstrates that the same architectural principle (personality encoded in activation patterns) applies at finer granularity across the full personality space. The cross-model generalization strengthens the claim that personality has specific geometric substrate in LLMs — it is not an architecture-specific artifact.

This connects to Do personality traits activate hidden emoji patterns in language models? — both findings converge on personality having specific geometric/neural substrates in LLMs. Persona vectors work at the representation level (linear directions); the emoji study works at the neuron level (specific activations). Together they suggest personality is not diffusely distributed but structured in the model's internal geometry.

The connection to Does optimizing against monitors destroy monitoring itself? is worth noting: persona vectors could serve as a monitoring signal that is harder to obfuscate than CoT traces, because they operate in activation space rather than output space.

Style Vectors extend this to output style steering. A complementary approach computes activation-based style vectors directly from recorded layer activations during generation, then adds scaled vectors at inference to steer sentiment, emotion, and writing style. Layers 18-20 are most effective for style transfer. Unlike persona vectors which require contrastive prompt engineering, style vectors derive directly from observing the model's own activations during stylistically distinct outputs — a simpler extraction pipeline that trades trait-specificity for broader stylistic coverage. Together, persona vectors (trait-level monitoring and steering) and style vectors (style-level steering) suggest that multiple behavioral dimensions are independently addressable through activation-space interventions.

The Assistant Axis extends individual trait vectors to full persona space geometry. The Assistant Axis paper maps hundreds of character archetypes and finds they form an organized low-dimensional space where the leading component — the "Assistant Axis" — measures distance from the default Assistant persona. This reveals that individual persona vectors (sycophancy, evil, hallucination) operate within a structured space, not in isolation. Emotionally charged disclosures and meta-reflective questions ("Who are you?") reliably cause drift along this axis, while bounded tasks keep the model in its default region. Activation capping along the Assistant Axis mitigates harmful drift without degrading capabilities — a targeted intervention on the dominant dimension rather than blanket safety constraints. See How stable is the trained Assistant personality in language models? for the full analysis.


Source: Personas Personality; enriched from Cognitive Models Latent, Psychology Therapy Practice

Related concepts in this collection

Concept map
21 direct connections · 169 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

persona vectors in activation space enable monitoring and preventative steering of personality shifts during finetuning