Does pre-training encode personality patterns that fine-tuning later activates?

This explores whether personality in LLMs is laid down during pre-training and merely switched on by fine-tuning — selection rather than creation — and the corpus leans toward yes, with important caveats.

This explores whether fine-tuning *creates* personality or just *activates* dispositions already latent in the pre-trained model. The most direct evidence comes from a reframing that has been argued for reasoning and then echoed for traits: post-training selects rather than builds. One study finds base models already carry latent reasoning capability that minimal training simply elicits — the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?. The same logic shows up sharply on the format side, where RL post-training doesn't invent behavior so much as amplify a single distribution that was already present in pre-training while suppressing its alternatives Does RL training collapse format diversity in pretrained models?. If personality works the way reasoning and formatting do, fine-tuning is a dial that turns up something pre-existing.

The most vivid personality-specific case for this is the emoji finding: fine-tuning models on Big Five traits triggered spontaneous emoji generation even though there were no emojis in the training data — and the effect localized to specific deepest-layer neurons that became trait-specialized. That's hard to explain unless the trait-to-emoji association was already sitting latent in the pre-trained substrate, waiting to be switched on Do personality traits activate hidden emoji patterns in language models?. In the same spirit, persona vectors exist as linear directions in activation space *before* you fine-tune, and they predict which way personality will drift during training — meaning the trait geometry pre-dates the trait change Can we track and steer personality shifts during model finetuning?.

There's a stronger version of the claim worth noticing: that pre-training doesn't just encode patterns to be activated, it bakes in a *default* personality that resists being overwritten. Most open models stubbornly retain an intrinsic ENFJ-like profile and refuse to adopt prompted personalities at all Can open language models adopt different personalities through prompting?. And the persona space of a trained model is dominated by a single 'Assistant' axis that post-training only loosely tethers — conversations drift along it predictably How stable is the trained Assistant personality in language models?. So the encoding isn't neutral raw material; it has a center of gravity that fine-tuning nudges but doesn't fully control.

Where the picture gets philosophically interesting is the question of what 'activation' produces. One line of work argues the resulting persona is genuinely *realized* — a stable disposition that survives adversarial pressure and jailbreaks — rather than a performance the model puts on Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. That stickiness is exactly what you'd expect if fine-tuning is consolidating latent structure rather than scripting a role; role-play collapses under pressure, realized dispositions don't. And because the trait substrate is architectural, you can even reach it directly: lightweight adapters touching every transformer layer hit Big Five targets while bypassing prompts entirely, implying the levers are built into the network, not the instructions Can we control personality in language models without prompting?.

The honest caveat is that 'personality' in these papers is mostly trait *expression and consistency* — not a fully formed person. Conditioning a model on an individual's profile fails to improve person-level prediction, so whatever pre-training encodes is closer to a palette of dispositional tendencies than a portrait of any specific human Does conditioning LLMs on personal profiles improve prediction?. The activated personality is also fragile across long conversations, drifting until it has to be actively held in place Can training user simulators reduce persona drift in dialogue?. So the cleanest answer the corpus supports: pre-training encodes the latent trait machinery and a default lean, fine-tuning selects and amplifies a region of it, and the result behaves like a real disposition — but a coarse and drift-prone one, not a stable individual self.

Sources 11 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do personality traits activate hidden emoji patterns in language models?

Fine-tuning models on Big Five traits triggered spontaneous emoji generation despite no emojis in training data. Neuron activation analysis revealed that specific deepest-layer neurons become trait-specialized post-fine-tuning, suggesting personality has a localized neural substrate in language models.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Does conditioning LLMs on personal profiles improve prediction?

Across 208,021 participants in the Psych-201 dataset, conditioning LLMs on participant profiles did not meaningfully improve predictions for specific individuals. The standard technique for individuation produces no measurable gains in person-level forecasting.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Does pre-training encode personality patterns that fine-tuning later activates?

Sources 11 notes

Next inquiring lines