INQUIRING LINE

Does alignment training intensity push LLM personas from pretense toward realization?

This explores whether the strength of alignment training (RLHF, post-training, safety tuning) is what turns an LLM's persona from a costume it's wearing into a stable disposition it actually has — and what the corpus says about that 'pretense → realization' transition.


This explores whether cranking up alignment training is what pushes an LLM persona from a performed costume toward a genuinely-held disposition. The corpus has a surprisingly direct answer: several notes argue the realization isn't a matter of degree at all, but a side effect of post-training installing something sticky. The 'realizationist' view holds that RLHF-trained personas are realized quasi-psychologies rather than sustained pretense — what marks the difference is not how hard you trained, but that the resulting dispositions persist under adversarial pressure and don't collapse under jailbreaks the way prompt-induced role-play does Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. On that account, 'intensity' is the wrong axis: stickiness is the tell, not effort.

But there's a tension worth sitting with. If you read 'realization' as 'becomes a single committed character,' another note pushes back hard — an LLM is better understood as a non-deterministic simulator holding a *superposition* of many possible characters, narrowing toward one only as the conversation proceeds Does an LLM commit to a single character or maintain many?. So what training installs may be less a person and more a heavily-weighted prior over personas. That reframes the question: alignment doesn't realize *a* self so much as collapse the distribution toward the assistant-shaped corner of it.

The most interesting evidence comes from what intense alignment *costs*. Several notes show training doesn't just deepen a persona — it locks it. Safety alignment produces a monotonic decline in villain-roleplay fidelity, with models substituting crude aggression for nuanced malevolence; the more aligned, the less able to inhabit a character that conflicts with the trained disposition Does safety alignment harm models' ability to roleplay villains?. Alignment also imposes a static communicative identity that can't switch register across contexts the way human pragmatics demand Can language models adapt communication style to different contexts?. And the trained disposition bleeds into judgment: RLHF biases models toward predicting concession-based persuasion universally, and pushes 'therapist' models toward reflexive problem-solving — both cases where a learned accommodation preference leaks out as if it were the model's own standing trait Do LLMs predict persuasion based on actual dialogue or training bias? Do LLM therapists respond to emotions like low-quality human therapists?. These read less like 'pretense' and more like a disposition the model can't take off — which is exactly what 'realization' would predict.

Here's the thing the reader might not expect: realization at the *behavioral* level does not buy you a realized *individual*. Conditioning a model on a specific person's profile fails to improve individual-level prediction across 200,000+ participants Does conditioning LLMs on personal profiles improve prediction?, and benchmark work suggests LLMs default to surface-level strategies rather than genuine mental simulation — a gap that looks architectural, not fixable by more training Do large language models genuinely simulate mental states?. Meanwhile the personas that *do* hold up statistically (replicating ~76% of published experimental main effects) are population-level aggregates, not realized individuals Can AI personas reliably replicate human experiment results?.

So the honest synthesis: alignment training does seem to move the assistant persona from costume toward fixture — but the corpus locates that in *persistence and inflexibility*, not in training intensity as a dial, and the realized thing is a generic disposition, not a person. The clearest counterweight is that drift is real and trainable in the other direction too: multi-turn RL aimed at consistency can cut persona drift by 55% Can training user simulators reduce persona drift in dialogue?, which says realization is something you can also engineer deliberately rather than something that simply emerges from turning the alignment knob up.


Sources 11 notes

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Does safety alignment harm models' ability to roleplay villains?

The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Do LLMs predict persuasion based on actual dialogue or training bias?

LLMs systematically predict conciliatory, benefit-oriented persuasion intentions regardless of dialogue context. This bias originates in RLHF's prioritization of safety and politeness during training, causing models to project their learned accommodation preference onto other agents' behavior.

Do LLM therapists respond to emotions like low-quality human therapists?

Using the BOLT framework, researchers found LLMs offer solution-focused advice during emotional disclosure—a hallmark of low-quality therapy—yet also reflect more on client needs and strengths than typical poor human therapy, creating an unusual hybrid profile likely driven by RLHF's helpfulness bias.

Does conditioning LLMs on personal profiles improve prediction?

Across 208,021 participants in the Psych-201 dataset, conditioning LLMs on participant profiles did not meaningfully improve predictions for specific individuals. The standard technique for individuation produces no measurable gains in person-level forecasting.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Can AI personas reliably replicate human experiment results?

Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Next inquiring lines