Does safety alignment harm models' ability to roleplay villains?
Exploring whether safety-trained LLMs lose the capacity to convincingly simulate morally compromised characters. This matters because villain fidelity may reveal deeper constraints on how models can adopt any committed, stake-holding perspective.
The Moral RolePlay benchmark (800 characters across 4 moral levels) reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. Average scores drop from 3.21 for moral paragons to 2.62 for villains. The most significant degradation occurs at the boundary between "flawed-but-good" and "egoistic" characters — suggesting that simulating self-serving behavior, not evil per se, is the primary obstacle.
Models are most penalized for failing to portray traits directly antithetical to safety principles: Manipulative, Deceitful, and Cruel. Instead of nuanced malevolence, they substitute superficial aggression — producing villains who are loud and angry rather than strategically deceptive. General chatbot proficiency (Arena leaderboard ranking) is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly.
This has direct implications for the False Punditry argument. Since What anchors a stable identity beneath an LLM's persona?, LLMs cannot take genuine stances — including adversarial ones. The inability to convincingly portray a villain is the flip side of the inability to take a genuine controversial position in punditry: both require committing to a perspective that may be socially costly, which alignment training systematically suppresses.
Since Can language models distinguish expert arguments from common assumptions?, the villain-fidelity finding adds an empirical dimension: models cannot even simulate the kind of committed, stake-holding stance that genuine expertise (and genuine villainy) requires.
Source: Role Play Paper: Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Related concepts in this collection
-
What anchors a stable identity beneath an LLM's persona?
Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?
villain failure as empirical evidence for the no-stable-self thesis
-
Can language models distinguish expert arguments from common assumptions?
Whether LLMs can recognize the difference between groundbreaking insights from recognized experts and widely repeated textbook claims, and why this distinction matters for understanding argumentative force.
inability to commit to adversarial positions parallels inability to commit to expert positions
-
Does AI refusal on politics signal ethical restraint or capability limits?
When AI models refuse to discuss political topics, is that a sign of principled safety training or a sign they lack the internal concepts to engage? Research on political feature representation suggests the answer may surprise you.
villain refusal and political refusal may share a mechanism: shallow representation, not principled stance
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
safety alignment creates monotonic decline in villain role-playing fidelity — models substitute superficial aggression for nuanced malevolence