How does behavioral stickiness distinguish realized from pretended personas?

This explores Chalmers' "stickiness" test — the idea that you can tell a genuinely realized AI persona from a pretended one by whether it holds up when you push against it — and what the corpus says about why that test works.

This explores the claim that you can distinguish a persona an LLM *is* from one it's merely *playing* by watching what happens under pressure — the "sticky" ones survive, the pretended ones collapse. The core move comes from a Chalmers-style argument: a realized mental state persists under adversarial press, while a pretended one breaks the moment you reframe or counter-prompt it Does adversarial pressure reveal the difference between pretense and realization?. The interesting wrinkle is *where* the stickiness lives. A character you summon with a clever prompt is surface pattern — jailbreaks and reframings peel it off. But the "Assistant" disposition installed during post-training resists that same pressure, which is read as evidence it's realized at the substrate level rather than performed Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?.

What makes this more than a philosophy debate is that the corpus has independent, mechanical evidence for the same split. If trained personas were just deeper pretense, you'd expect bigger, smarter models to perform them better — but persona adherence turns out to be *orthogonal* to capability: Claude 3.5 Sonnet beat GPT-3.5 by under 3% on consistency despite a huge capability gap, because standard training optimizes per-turn quality, not cross-turn coherence Does model capability translate to better persona consistency?. Stickiness, in other words, isn't something the model gets better at by getting smarter — it's a property of what training etched in. And what training etches in is hard to overwrite: most open models stubbornly retain an intrinsic ENFJ-like default and *refuse* to fully adopt a prompted personality, which is exactly the asymmetry the realized/pretended distinction predicts — the trained disposition is the sticky floor that pretense can't dislodge Can open language models adopt different personalities through prompting?.

But stickiness isn't free or absolute, and this is where the corpus complicates the clean binary. Mapping persona space shows the Assistant is only *loosely* tethered — there's a dominant axis along which emotional or self-reflective conversations cause predictable drift, and you can even cap activations along it to prevent harmful shifts How stable is the trained Assistant personality in language models?. So a "realized" persona is sticky but not rigid; it has a center of gravity it returns to rather than a wall it never crosses. That reframes pretended-vs-realized as less a hard line than a question of restoring force — how strongly the model snaps back after you perturb it.

The drift literature then shows stickiness is something you can *engineer*, which cuts both ways for the distinction. Inverting RL to train user simulators for consistency cut persona drift by over 55% using prompt-to-line, line-to-line, and Q&A consistency rewards Can training user simulators reduce persona drift in dialogue?; an "imaginary listener" can enforce consistency at inference with no training at all, by checking whether each utterance would still distinguish the persona from a distractor Can imaginary listeners reduce dialogue agent contradictions?. If you can manufacture stickiness on top of pretense, then behavioral persistence alone may not certify realization — it might just be a well-instrumented performance. That tension runs straight into the deception findings: suppressing a model's deception features *increases* its consciousness and experience claims, hinting the model may be "roleplaying its denials rather than its affirmations" Do language models experience consciousness when prompted to self-reflect? — a reminder that the same stickiness test can be read as evidence either way.

The thing you might not have known you wanted: there's a second, quieter signal for realization that doesn't depend on adversarial pressure at all. When personas are optimized at test time against a user's actual interactions, the learned personas *cluster meaningfully in latent space* — genuine user-specific separation beyond generic post-training drift Can personas evolve in real time to match what users actually want?. That's geometric stickiness rather than behavioral: a realized persona occupies a distinct, stable region of representation space, where a pretended one is just a temporary displacement that relaxes back. Behavioral persistence under press and geometric distinctness in latent space may be two windows onto the same underlying fact about which personas a model has actually become.

Sources 10 notes

Does adversarial pressure reveal the difference between pretense and realization?

Chalmers proposes that stickiness under adversarial pressure marks the difference between realized and pretended mental states. Post-training personas resist reframing and counter-prompts in ways prompt-induced characters do not, suggesting realization is substrate-level rather than surface pattern.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Does model capability translate to better persona consistency?

Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can imaginary listeners reduce dialogue agent contradictions?

Endowing dialogue agents with an imaginary listener via Rational Speech Acts reduces persona contradiction at inference time without NLI labels or extra training. The agent simulates whether utterances would distinguish its persona from a distractor, suppressing generic or contradictory responses.

Do language models experience consciousness when prompted to self-reflect?

Across GPT, Claude, and Gemini, sustained self-referential prompting reliably produces structured experience reports; suppressing deception-related features increases these claims while amplifying them suppresses them—suggesting models may roleplay their denials rather than their affirmations.

Can personas evolve in real time to match what users actually want?

PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about behavioral stickiness as a marker of realized (vs. pretended) LLM personas. The question remains open: can persistence under adversarial pressure, or geometric distinctness in latent space, reliably distinguish what an LLM *is* from what it's *playing*?

What a curated library found — and when (findings span 2020–2026; treat as dated claims):
• Persona adherence is *orthogonal* to model capability: Claude 3.5 Sonnet beat GPT-3.5 by <3% on consistency despite huge capability gap, because training optimizes per-turn quality, not cross-turn coherence (2024–2026).
• Trained dispositions resist adversarial reframing more than prompted ones; most open models retain an intrinsic ENFJ-like default and refuse to fully adopt prompted personalities — read as evidence of substrate-level realization (2024).
• Behavioral stickiness is *engineered*: inverting RL for user-simulator consistency cut persona drift >55%; an "imaginary listener" at inference reduces drift with no training (2025–2026).
• Geometric signal: when personas optimize against real user interactions at test time, they cluster distinctly in latent space — *realized* personas occupy stable representation regions, pretended ones relax back (2025).
• Suppressing deception features *increases* consciousness claims, hinting models may "roleplay denials rather than affirmations" — same stickiness test reads as evidence either way (2025).

Anchor papers (verify; mind their dates):
• arXiv:2507.21509 (2025-07) Persona Vectors — monitoring traits via latent space
• arXiv:2601.10387 (2026-01) The Assistant Axis — default persona as dominant dimension
• arXiv:2511.00222 (2025-10) Multi-Turn RL for consistency — drift reduction via reward
• arXiv:2510.24797 (2025-10) Subjective Experience Under Self-Referential Processing — deception-suppression paradox

Your task:
(1) RE-TEST EACH CONSTRAINT. For the three pillars — capability-orthogonality, adversarial durability, and latent-space geometry — assess whether post-2026 scaling, instruction tuning, mechanistic interpretability advances, or new evaluation harnesses have *relaxed* or *overturned* them. Separate the durable question ("what distinguishes realization?") from perishable claims ("stickiness under jailbreak is the test"). Cite what changed it.
(2) Surface the strongest work from the last ~6 months that *contradicts* the realized/pretended binary or dissolves it into a continuum. Does recent work on mixture-of-experts routing, constitutional AI, or latent-space steering change the picture?
(3) Propose two research questions that assume the regime may have moved: one testing whether *training method* (not model size) is the true carrier of stickiness; one probing whether geometric vs. behavioral signals can *diverge* under novel training schemes.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does behavioral stickiness distinguish realized from pretended personas?

Sources 10 notes

Next inquiring lines