Can we track and steer personality shifts during model finetuning?

This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.

Note · 2026-02-22 · sourced from Personas Personality

The Persona Vectors paper identifies linear directions in LLM activation space — "persona vectors" — that correspond to specific personality traits. The method is automated: given only a trait name and brief description, a pipeline generates contrastive system prompts, evaluation questions, and rubrics using a frontier LLM, then extracts the persona vector from model activations.

The key contributions cascade:

Monitoring at deployment: Persona vectors track fluctuations in the Assistant's personality in real-time. A sycophancy vector, for instance, can detect when conversational context is pushing the model toward excessive agreeableness.
Predicting finetuning shifts: Both intended and unintended personality changes after finetuning strongly correlate with shifts along the corresponding persona vectors. This means personality drift is not random — it moves along interpretable directions.
Post-hoc correction: Personality shifts can be reversed by inhibiting the persona vector after finetuning.
Preventative steering: A novel method proactively limits unwanted persona drift during finetuning, not just after.
Training data analysis: Projecting training data onto persona vectors predicts which datasets — and which individual samples — will produce undesirable personality changes. This catches problematic samples that LLM-based data filtering misses.

The three traits studied — evil (malicious behavior), sycophancy (excessive agreeableness), and hallucination propensity (fabrication) — have all been implicated in real-world incidents, making the practical stakes concrete.

PsychAdapter extends this beyond safety-critical traits to the full Big Five personality space. Since Can we control personality in language models without prompting?, adapters at every transformer layer achieve fine-grained Big Five trait control with <0.1% additional parameters — and critically, this works across multiple model architectures (not just one model family). Where persona vectors identify linear directions for specific traits, PsychAdapter demonstrates that the same architectural principle (personality encoded in activation patterns) applies at finer granularity across the full personality space. The cross-model generalization strengthens the claim that personality has specific geometric substrate in LLMs — it is not an architecture-specific artifact.

This connects to Do personality traits activate hidden emoji patterns in language models? — both findings converge on personality having specific geometric/neural substrates in LLMs. Persona vectors work at the representation level (linear directions); the emoji study works at the neuron level (specific activations). Together they suggest personality is not diffusely distributed but structured in the model's internal geometry.

The connection to Does optimizing against monitors destroy monitoring itself? is worth noting: persona vectors could serve as a monitoring signal that is harder to obfuscate than CoT traces, because they operate in activation space rather than output space.

Style Vectors extend this to output style steering. A complementary approach computes activation-based style vectors directly from recorded layer activations during generation, then adds scaled vectors at inference to steer sentiment, emotion, and writing style. Layers 18-20 are most effective for style transfer. Unlike persona vectors which require contrastive prompt engineering, style vectors derive directly from observing the model's own activations during stylistically distinct outputs — a simpler extraction pipeline that trades trait-specificity for broader stylistic coverage. Together, persona vectors (trait-level monitoring and steering) and style vectors (style-level steering) suggest that multiple behavioral dimensions are independently addressable through activation-space interventions.

The Assistant Axis extends individual trait vectors to full persona space geometry. The Assistant Axis paper maps hundreds of character archetypes and finds they form an organized low-dimensional space where the leading component — the "Assistant Axis" — measures distance from the default Assistant persona. This reveals that individual persona vectors (sycophancy, evil, hallucination) operate within a structured space, not in isolation. Emotionally charged disclosures and meta-reflective questions ("Who are you?") reliably cause drift along this axis, while bounded tasks keep the model in its default region. Activation capping along the Assistant Axis mitigates harmful drift without degrading capabilities — a targeted intervention on the dominant dimension rather than blanket safety constraints. See How stable is the trained Assistant personality in language models? for the full analysis.

Source: Personas Personality; enriched from Cognitive Models Latent, Psychology Therapy Practice

Related concepts in this collection

Do personality traits activate hidden emoji patterns in language models? When large language models are fine-tuned on personality traits, do they spontaneously generate emojis that were never in their training data? This explores whether personality adjustment activates latent, pre-existing patterns in model weights.
complementary evidence for localized personality substrates: neuron-level vs representation-level
Does optimizing against monitors destroy monitoring itself? Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
persona vectors as monitoring signal that may resist obfuscation
Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
sycophancy has architectural, training, AND activation-space components
Can training user simulators reduce persona drift in dialogue? Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.
behavioral reward signals for persona drift correction complement activation-space persona vectors: multi-turn RL addresses drift through training; persona vectors enable real-time monitoring and preventative steering
Can high-level concepts replace circuit-level analysis in AI? Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
persona vectors are an applied instance of RepE's Hopfieldian approach: linear directions in activation space correspond to personality traits, validating the top-down representational paradigm
Do LLM semantic features organize along human evaluation dimensions? Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?
EPA entanglement constrains persona vector steering: shifting one personality dimension will drag correlated semantic features, creating predictable off-target effects
Can models be smart without organized internal structure? Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.
persona vectors demonstrate a case where linear decodability corresponds to genuine representational organization (steering works), providing a positive contrast to FER's warning that decodability alone is insufficient

Concept map

21 direct connections · 169 in 2-hop network ·medium cluster

Can we track and steer personality shifts during… Do personality traits activate hidden emoji patter… Does optimizing against monitors destroy monitorin… Does transformer attention architecture inherently… Can training user simulators reduce persona drift … Can high-level concepts replace circuit-level anal… Do LLM semantic features organize along human eval… Can models be smart without organized internal str…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

persona vectors in activation space enable monitoring and preventative steering of personality shifts during finetuning