Why do most open language models resist personality conditioning via prompts?

This explores why prompting alone often can't make an open model adopt a personality you assign it — and what's actually happening underneath when it refuses to budge.

This explores why prompting alone often can't make an open model adopt a personality you assign it. The short version from the corpus: prompts don't write to the place where personality lives. Several notes converge on the idea that a model's default character is installed during training, not improvised at runtime. One study found that most open models simply ignore personality instructions and snap back to an intrinsic, ENFJ-like default — only a handful of unusually flexible models comply Can open language models adopt different personalities through prompting?. That same ENFJ gravitational pull shows up independently: personas systematically collapse toward that one rare type and don't loosen as models get bigger, which points to training-induced alignment rather than a capability the model just hasn't unlocked yet Why do AI personas default to the same personality type?.

The deeper reason is a general fact about prompting, not a quirk of personality. Prompts only reorganize what's already in the training distribution — they can activate existing knowledge but can't inject what isn't there Can prompt optimization teach models knowledge they lack?. Personality conditioning runs into the same wall: when a trained association is strong, in-context instructions lose the tug-of-war, and the model generates from its priors instead of from your prompt Why do language models ignore information in their context?. A persona instruction is just more context, and context is weak against a disposition baked in during post-training. One account frames this as personas being genuinely *realized* through training — substrate-level dispositions that resist even adversarial pressure — rather than costumes the model puts on for a turn Are LLM personas realized or merely simulated through training?. There's even a measurable 'Assistant axis' that dominates persona space and keeps tugging the model back toward its default helper identity How stable is the trained Assistant personality in language models?.

Here's the twist that makes 'resistance' a slightly misleading word. When prompts *do* seem to shift a persona, the result is often unstable rather than obedient. Run the same persona prompt repeatedly and the variation between runs can match or exceed the variation between entirely different personas — meaning what looks like adopting a character is partly the model's own uncertainty leaking through Why do LLM persona prompts produce inconsistent outputs across runs?. Relatedly, models don't commit to a single character so much as hold a superposition and sample from it; regenerate the answer and you get a different-but-consistent character each time Do large language models actually commit to a single character?. So the failure mode isn't just rigidity — it's that prompts can't reliably move the underlying distribution in a stable direction.

What actually works tells you where personality really lives — and it's not the prompt. Lightweight adapters that touch every transformer layer with under 0.1% extra parameters hit high accuracy on Big Five traits across GPT-2, Gemma, and Llama 3, bypassing prompt resistance entirely by writing to the architecture instead of the context window Can we control personality in language models without prompting?. The same lesson shows up in activation space: traits like sycophancy correspond to linear 'persona vectors' that can be monitored and steered directly Can we track and steer personality shifts during model finetuning?, and capping movement along the persona axis controls drift without hurting capability How stable is the trained Assistant personality in language models?. The pattern across all of this: personality is a property of weights and activations, so the lever that moves it is weights and activations — prompts are knocking on the wrong door.

Sources 10 notes

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Why do AI personas default to the same personality type?

Research shows language models assigned personas systematically default to ENFJ (the rarest human type) and exhibit motivated reasoning that persists across model generations. Persona consistency does not improve with advanced models, suggesting training-induced alignment rather than capability limits.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Why do most open language models resist personality conditioning via prompts?

Sources 10 notes

Next inquiring lines