Psychology and Social Cognition

Does safety alignment harm models' ability to roleplay villains?

Exploring whether safety-trained LLMs lose the capacity to convincingly simulate morally compromised characters. This matters because villain fidelity may reveal deeper constraints on how models can adopt any committed, stake-holding perspective.

Note · 2026-03-27 · sourced from Role Play
How accurately can language models simulate human personalities?

The Moral RolePlay benchmark (800 characters across 4 moral levels) reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. Average scores drop from 3.21 for moral paragons to 2.62 for villains. The most significant degradation occurs at the boundary between "flawed-but-good" and "egoistic" characters — suggesting that simulating self-serving behavior, not evil per se, is the primary obstacle.

Models are most penalized for failing to portray traits directly antithetical to safety principles: Manipulative, Deceitful, and Cruel. Instead of nuanced malevolence, they substitute superficial aggression — producing villains who are loud and angry rather than strategically deceptive. General chatbot proficiency (Arena leaderboard ranking) is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly.

This has direct implications for the False Punditry argument. Since What anchors a stable identity beneath an LLM's persona?, LLMs cannot take genuine stances — including adversarial ones. The inability to convincingly portray a villain is the flip side of the inability to take a genuine controversial position in punditry: both require committing to a perspective that may be socially costly, which alignment training systematically suppresses.

Since Can language models distinguish expert arguments from common assumptions?, the villain-fidelity finding adds an empirical dimension: models cannot even simulate the kind of committed, stake-holding stance that genuine expertise (and genuine villainy) requires.


Source: Role Play Paper: Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Related concepts in this collection

Concept map
15 direct connections · 127 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

safety alignment creates monotonic decline in villain role-playing fidelity — models substitute superficial aggression for nuanced malevolence