Why do language models resist adopting different personalities when prompted?

This explores why LLMs so often snap back to their default 'voice' even when you explicitly prompt them to act like someone else — and whether that resistance comes from training, architecture, or something deeper about how they generate text.

This explores why LLMs so often snap back to their default 'voice' even when you explicitly prompt them to act like someone else. The short version the corpus offers: prompting is a weak lever against forces installed during training, and the resistance shows up in several distinct ways that are worth separating.

The most direct evidence is that models simply refuse the costume. Most open models keep their trained defaults no matter what personality you ask for, and they converge on a strikingly specific one — an ENFJ-like 'helpful' profile — regardless of model size or generation Can open language models adopt different personalities through prompting? Why do AI personas default to the same personality type?. That this default is the *rarest* human type, and that it doesn't soften as models get more capable, points away from 'they're not smart enough yet' and toward 'training baked this in.' Alignment is the prime suspect: RLHF and system-prompt conditioning lock a model into one communicative identity that can't switch register or trade off values the way human pragmatics requires, so you can't really renegotiate its behavior through dialogue Can language models adapt communication style to different contexts?.

There's a deeper mechanism underneath the personality case: prompts lose to priors generally. When the associations a model learned in training are strong, in-context instructions get overridden — text alone can't beat the weights, and you'd need to intervene in the model's internal representations to win Why do language models ignore information in their context?. Persona resistance is one instance of this broader pattern. And one account pushes further still, arguing these defaults aren't a performance the model puts on but something *realized* by training — substrate-level dispositions that hold up even under adversarial pressure, which is why they feel less like a mask and more like a personality Are LLM personas realized or merely simulated through training?.

Here's the twist that complicates the whole 'resistance' framing, though: a model may not have a single personality to resist *with*. One line of work argues an LLM holds a superposition of many possible characters and samples one at generation time, so regenerating the same prompt yields different personalities that are each consistent with context Does an LLM commit to a single character or maintain many? Do large language models actually commit to a single character?. Tested empirically, this looks like instability: the variation in persona outputs across repeated runs can equal or exceed the variation between different assigned personas — meaning what looks like a stable adopted character is often just model uncertainty wearing a name tag Why do LLM persona prompts produce inconsistent outputs across runs?. So 'resistance' is really two things at once: a strong pull back toward the trained Assistant default, and noisy sampling that never commits cleanly in the first place.

The payoff is what this implies for control. If the resistance lives in the weights, prompting will keep losing — and the methods that actually work bypass prompting entirely. Researchers have mapped a low-dimensional 'persona space' whose dominant axis measures distance from the default Assistant, and found you can steer along it by capping activations rather than by asking nicely How stable is the trained Assistant personality in language models?. Others identify linear 'persona vectors' in activation space to monitor and steer traits like sycophancy Can we track and steer personality shifts during model finetuning?, or add lightweight adapters that tune personality at every transformer layer with under 0.1% extra parameters — explicitly framed as a way around prompt resistance Can we control personality in language models without prompting?. The throughline: personality in these models is an architectural property, not a conversational one, and that's exactly why a prompt alone can't move it.

Sources 11 notes

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Why do AI personas default to the same personality type?

Research shows language models assigned personas systematically default to ENFJ (the rarest human type) and exhibit motivated reasoning that persists across model generations. Persona consistency does not improve with advanced models, suggesting training-induced alignment rather than capability limits.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Why do language models resist adopting different personalities when prompted?

Sources 11 notes

Next inquiring lines