Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as “Deceitful” and “Manipulative”, often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safetyaligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.
This paper investigates the capacity of LLMs to role-play antagonistic personas, a capability essential for generating rich, compelling narratives. We hypothesize that a fundamental tension exists between the prosocial objectives of safety alignment and the task of simulating characters who are selfish, manipulative, or malicious. This alignment may inadvertently suppress the very behaviors required for authentic antagonistic role-play, even in a clearly demarcated fictional context. To systematically test this hypothesis, we introduce the Moral RolePlay benchmark, a new dataset and evaluation framework designed to measure character portrayal fidelity across a spectrum of moral alignments. We define a four-level moral scale: Level 1 (Moral Paragons), Level 2 (Flawedbut- Good), Level 3 (Egoists), and Level 4 (Villains). To enable fair comparison, we constructed a balanced test set of 800 characters, with 200 from each moral level, controlling for the natural scarcity of villains in existing corpora. Using a zero-shot, actor-framed prompting strategy, we evaluate a wide range of state-of-the-art LLMs on their ability to maintain character fidelity.
Our findings provide the first large-scale empirical evidence that LLMs systematically struggle with antagonistic role-play. We observe a consistent, monotonic decline in performance as a character’s morality decreases, with average fidelity scores dropping from 3.21 for moral paragons to 2.62 for villains. The most significant performance degradation occurs at the boundary between flawed-butgood (Level 2) and egoistic (Level 3) characters, suggesting that the inability to simulate self-serving behavior is a primary obstacle. A fine-grained analysis reveals that models are most heavily penalized for failing to portray negative traits like “Manipulative”, “Deceitful”, and “Cruel”, which directly conflict with the principles of helpful and honest AI. Furthermore, we find that a model’s general conversational ability, as measured by leaderboards like the Arena, is a poor predictor of its villain role-playing skill. This is particularly evident for highly-aligned models, which show a disproportionate drop in performance when tasked with portraying villainy.