Can activation-level persona vectors predict which weight regions encode personality?

This explores whether the linear 'persona vectors' researchers find in a model's activations (the live signals during a forward pass) can tell us where in the actual weights — the stored parameters — personality lives, and the corpus suggests these are two different questions that the research has mostly kept separate.

This explores whether the linear 'persona vectors' researchers find in a model's activations (the live signals during a forward pass) can tell us where in the actual weights — the stored parameters — personality lives. The honest answer from the corpus: the work on activation vectors and the work on weights run on parallel tracks, and nobody here cleanly bridges them — but reading them against each other reveals why the bridge is hard. Activation-space research finds that traits like sycophancy or hallucination correspond to clean linear directions you can read off and even steer in real time Can we track and steer personality shifts during model finetuning?. Related work maps a whole low-dimensional 'persona space' whose dominant axis measures how far the model has drifted from its default Assistant character, and shows you can cap activity along that axis to prevent harmful shifts How stable is the trained Assistant personality in language models?. Notice what both do: they intervene on activations, not weights. They tell you a trait is *active*, not where it's *stored*.

The weight side of the corpus tells a story that complicates any neat 'this vector points to that region' hope. PsychAdapter achieves strong personality control — 87% Big Five accuracy — by modifying *every transformer layer* with a tiny parameter budget Can we control personality in language models without prompting?. That distributed footprint is the key tension: if a trait can be installed by touching all layers at once, then personality isn't a localized 'region' an activation vector could point at like an address. It's smeared across the network. So even a perfect activation reading might not resolve to a compact weight neighborhood, because the thing it's reading is the sum of many small contributions.

There is, though, a suggestive empirical thread connecting the two levels. The activation-vector work shows persona directions can *predict* personality shifts that finetuning will cause — before training even runs Can we track and steer personality shifts during model finetuning?. Finetuning is precisely the process that edits weights. So the vector isn't predicting a static 'region' so much as predicting which way the weights will *move* under a given training pressure. That reframes your question: activation vectors may be better at forecasting weight *changes* than at localizing weight *storage*.

Why does the storage version stay so stubborn? Two notes from the corpus suggest the trait is genuinely dispositional rather than a surface feature you could pin down. The 'realizationism' work argues post-training installs stable quasi-psychologies that survive adversarial pressure and jailbreaks Are RLHF personas performed characters or realized dispositions?, and a companion piece frames trained personas as substrate-level dispositions rather than performances Are LLM personas realized or merely simulated through training?. If personality is a robust disposition baked deep into the substrate, it's more plausibly distributed than localized — which is exactly the picture PsychAdapter's all-layers approach paints from the engineering side.

The thing worth walking away with: the field currently has good tools for *reading* and *steering* personality in activation space, and good tools for *installing* it across weights, but the inverse problem you're asking about — using the activation signal as a map back to the weights — isn't solved in this collection, and the distributed-installation evidence hints it may not have a tidy solution. The more tractable and arguably more useful target is what the persona-vector monitoring work already does: predict how weights will *shift* during finetuning, and intervene before the drift happens.

Sources 5 notes

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can activation-level persona vectors predict which weight regions encode personality?

Sources 5 notes

Next inquiring lines