Do open-source LLMs show different resistance patterns to persona prompting than closed models?

This explores whether open-source LLMs and closed/commercial models behave differently when you try to push a persona onto them through prompting — and the corpus reframes the question in a useful way.

This explores whether open-source LLMs resist persona prompting differently than closed models. The most direct evidence in the corpus is about open models specifically: most of them are surprisingly stubborn. When researchers tried to condition open LLMs into different personalities, the majority refused, snapping back to an intrinsic ENFJ-like default baked in during training — only a handful of flexible models actually took on the prompted personality, and even combining role-play with personality cues didn't fully override the resistance Can open language models adopt different personalities through prompting?. So 'resistance to persona prompting' isn't a bug here; it's a measurable, model-dependent trait.

The more interesting move is to ask *why* a model would resist at all, and here the corpus splits into two camps that cut across the open/closed line. One camp says personas are genuinely installed by post-training: a persona becomes a substrate-level disposition that holds up even under adversarial pressure, which is exactly what 'resistance' would look like from the outside Are LLM personas realized or merely simulated through training?. The opposing camp says there's no fixed character to resist *with* — a model holds a superposition of possible characters and samples one at generation time, so regenerating the same prompt yields different personas each time Do large language models actually commit to a single character?. If that view is right, what looks like 'resistance' in open models may instead be a strong default in the sampling distribution, not a principled refusal.

That distinction matters because the corpus shows persona prompting is shaky regardless of how open the weights are. Run the same persona prompt repeatedly and the variance *across runs* matches or exceeds the variance *across different personas* — meaning model uncertainty, not stable social knowledge, is driving the output Why do LLM persona prompts produce inconsistent outputs across runs?. That instability is a different failure mode than outright resistance: a resistant model ignores your persona, an unstable one accepts it inconsistently. People who want reliable personas have started training the drift out — multi-turn RL aimed at persona consistency cuts drift by 55%, which tells you consistency is something you have to engineer rather than something the base model gives you for free Can training user simulators reduce persona drift in dialogue?.

The honest answer to your literal question: the corpus has solid evidence that open models vary widely among *themselves* in persona-flexibility, but it doesn't run a clean head-to-head against closed models on this axis. What it offers instead is the better question underneath yours — resistance, instability, and sampling are three distinct behaviors that get lumped together as 'the model won't take the persona.' If you want to go deeper on the philosophical stakes of whether there's even a 'self' there to resist, the realized-persona Are LLM personas realized or merely simulated through training? and superposition Do large language models actually commit to a single character? notes are the two doorways worth opening.

Sources 5 notes

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Do open-source LLMs show different resistance patterns to persona prompting than closed models?

Sources 5 notes

Next inquiring lines