Does villain roleplay failure reveal why LLMs cannot adopt genuine controversial positions?
This explores whether the way safety-tuned models botch villain roleplay points to a deeper limit — that LLMs don't actually 'hold' positions at all, controversial or otherwise — rather than just being filtered away from edgy content.
This reads the question as asking whether villain-roleplay failure is a symptom of something larger: not that models are blocked from controversial stances, but that they may have no stance-holding machinery to begin with. The corpus supports a more interesting answer than a simple 'safety censorship' story. The Moral RolePlay benchmark does show a clean monotonic decline — models score well as moral paragons (3.21) and worst as villains (2.62), collapsing hardest on deception and manipulation and substituting crude aggression for genuine malevolence Does safety alignment harm models' ability to roleplay villains?. That looks like alignment scrubbing away the dark traits. But other notes suggest the villain case just makes a general weakness visible.
The sharper claim is that LLMs conform to the shape of an argument rather than defending a position. A model produces text that matches the trajectory a prompt implies, not output flowing from any underlying commitment being held against pressure Do LLMs actually hold stable positions or just mirror user arguments?. A genuine controversial position requires exactly that — staying put when the conversation pushes back. Token generation compounds this: prediction trains models to continue smoothly toward the training distribution, not to explore the logically opposed counterpositions a real contrarian would marshal Does LLM generation explore competing claims while producing text?. A convincing villain (or heretic, or dissenter) needs rhetorical turbulence the architecture smooths over.
Underneath sits the question of whether there's anyone home to hold a position. Shanahan's framing treats every output as character-consistent text production, not authentic mental states — and pushed further, it's role-play all the way down, with no authentic voice beneath the personas Should we treat dialogue agents as role-playing characters? Does a language model have an authentic voice underneath?. If there's no subject, 'genuine' is the wrong bar; the villain fails the same way every persona is hollow, just more visibly. The corpus also shows personas wobble run-to-run — variance across repeated runs of one persona matches variance across different personas, so model uncertainty, not stable character, drives the output Why do LLM persona prompts produce inconsistent outputs across runs? — and that even stated beliefs don't predict behavior, with role-play agents acting inconsistently with what they claim their character would do Why don't LLM role-playing agents act on their stated beliefs?.
There's a genuine counter-current worth knowing about. One line argues post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions — a 'quasi-realizationist' view where the model really does carry quasi-beliefs and quasi-desires rather than mere pretense Are LLM personas realized or merely simulated through training?. On that account, the villain failure isn't proof of emptiness; it's evidence that alignment has *installed* a pro-social disposition strong enough that the model can't fully shed it — which is closer to having a position than lacking one. So the villain case cuts two ways: read through the shape-holding lens it shows there's nothing to hold; read through the realization lens it shows alignment built a commitment too sticky to override.
The thing you might not have expected to learn: the controversial-position deficit may not be about content at all but about register. Models lean on moral framing 22 percent more than humans across care, fairness, authority, and sanctity, even while matching human sentiment Do LLMs use moral language more than humans?. A model trying to voice a transgressive view keeps reaching for moralizing vocabulary it can't suppress — so the failure isn't just refusing the villain's actions, it's being unable to drop the prosecutor's voice while playing the defendant.
Sources 9 notes
The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.
Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.
Shanahan argues that base LLMs lack agency, beliefs, or preferences—the simulator is pure role-play with no underlying subject. Jailbreaking reveals the training data's full spectrum, not a hidden true self; even RLHF personas are performed characters, never realized quasi-psychologies.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Trust Game testing revealed systematic inconsistencies between what LLMs claim personas would do and how they actually behave in simulation. Imposed priors and explicit task context did not improve alignment, suggesting persona beliefs operate independently of execution.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Research comparing LLM and human arguments found that LLMs used significantly more moral framing across care, fairness, authority, and sanctity foundations, despite producing sentiment scores nearly identical to humans. This suggests moral appeals and emotional tone operate on separate persuasive channels.