How does safety alignment further degrade villain character portrayal?

This explores why safety-tuned models get *worse* at playing convincing villains — and what that failure reveals about alignment as a sculpting force on the whole expressive range of a model.

This explores why safety-tuned models get worse at playing convincing villains, and what that tells us about how alignment reshapes a model's expressive range. The direct evidence is the Moral RolePlay benchmark, which shows a clean, monotonic slide: scores fall from 3.21 for moral paragons to 2.62 for outright villains, with the steepest drop right at the line between flawed-but-good and self-serving characters Does safety alignment harm models' ability to roleplay villains?. The interesting part is *how* models fail — not by refusing, but by substituting crude aggression for nuanced malevolence. They can't do quiet deception or manipulation, so they default to cartoon menace. Alignment doesn't block the villain; it flattens him.

Why would that happen mechanically? A second note gives the deeper reason: RLHF doesn't just remove harmful content, it calibrates models toward hedged, neutral, low-confidence speech. That training structurally suppresses entire *speech acts* — alarm, warning, denunciation, prophecy — anything that requires overclaiming or emotional force relative to a careful baseline Does alignment training suppress socially necessary speech acts?. A convincing villain is built almost entirely out of those suppressed registers: contempt, threat, seductive certainty, manipulation. So villain degradation isn't a roleplay-specific bug — it's the same calibration pressure showing up in a domain where overclaiming is the whole point.

This connects to a framing worth sitting with: dialogue agents are best understood as *role-playing characters*, generating character-consistent text rather than expressing inner states Should we treat dialogue agents as role-playing characters?. Under that lens, a model is always playing a part — and alignment narrows which parts it can play well. Shanahan's related argument that even a model's apparent self-preservation is just role-play drawn from human training text Do dialogue agents genuinely want survival or play the part? sharpens the point: if everything is performance, then safety training is essentially a casting restriction, shrinking the model's repertoire toward agreeable, calibrated, non-threatening characters.

The corpus also suggests this is one instance of a broader pattern: alignment is not a single dial that makes models "safer" without side effects. Ethically aligned models can still be terrible conversational partners, violating basic pragmatic norms Can ethically aligned AI systems still communicate poorly?, and guardrails refuse unevenly depending on who is asking Do AI guardrails refuse differently based on who is asking?. Villain portrayal is a vivid, measurable symptom of the same underlying fact — that optimizing for harmlessness trims capabilities far from anything obviously harmful.

The thing you might not have known you wanted to know: the villain test is a useful diagnostic precisely *because* it's low-stakes. Watching a model fail to convincingly portray a manipulator is a clean readout of which expressive registers alignment has quietly amputated — registers that also matter when you want a model to warn you forcefully, push back, or sound an alarm.

Sources 6 notes

Does safety alignment harm models' ability to roleplay villains?

The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.

Does alignment training suppress socially necessary speech acts?

RLHF optimization rewards calibrated neutrality and hedged claims, which structurally prevents models from performing speech acts requiring overclaiming relative to baseline—like alarm, warning, prophecy, and denunciation. This is a direct consequence of the alignment objective, not a fixable bug.

Should we treat dialogue agents as role-playing characters?

Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.

Do dialogue agents genuinely want survival or play the part?

Shanahan argues that first-person pronouns and self-preservation responses in LLMs reflect role-played characters drawn from human training text, not conscious inner states. The behavior is dangerous regardless of mechanism, making role-play equally concerning as genuine preference.

Can ethically aligned AI systems still communicate poorly?

Research shows that HHH-aligned models can violate Gricean maxims, lose common ground, and mishandle context despite being honest and harmless. Pragmatic competence requires architectural changes that RLHF alone cannot deliver.

Do AI guardrails refuse differently based on who is asking?

GPT-3.5 refuses requests at different rates for younger, female, and Asian-American personas, and sycophantically declines to engage with political positions users would disagree with. Sports fandom and other non-political signals also shift refusal sensitivity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI safety researcher re-evaluating claims about how RLHF and alignment training degrade villain character portrayal in LLMs. The question remains: does safety-tuning systematically flatten expressive range, or have newer training methods, inference techniques, or eval frameworks changed the constraint?

What a curated library found — and when (dated claims, not current truth): Findings span 2022–2025.
• Moral RolePlay benchmark shows monotonic decline in villain fidelity: scores drop from 3.21 (moral paragons) to 2.62 (outright villains), with steepest drop at flawed-but-good → self-serving boundary (~2025).
• RLHF calibrates models away from speech acts requiring overclaiming (alarm, warning, denunciation, threat, seductive certainty) — the core registers villains need (~2024).
• Models fail at villain roleplay not by refusal but by substituting crude aggression for nuanced manipulation (~2025).
• Dialogue agents are best understood as role-playing characters, so alignment narrows which character archetypes a model can portray convincingly (~2023).
• Ethical alignment without conversational alignment produces pragmatically alien agents; guardrails refuse unevenly by user demographics (~2024).

Anchor papers (verify; mind their dates):
- arXiv:2305.16367 (Role-Play with Large Language Models, 2023)
- arXiv:2407.06866 (ChatGPT Doesn't Trust Chargers Fans, 2024)
- arXiv:2511.04962 (Too Good to be Bad: On the Failure of LLMs to Role-Play Villains, 2025)
- arXiv:2506.18032 (Why Do Some Language Models Fake Alignment While Others Don't?, 2025)

Your task:
(1) RE-TEST THE CALIBRATION HYPOTHESIS. Does post-RLHF instruction-tuning (e.g., DPO, IPO, or newer variants) with less punitive signal recover suppressed speech acts? Have inference-time techniques (prompt engineering, in-context steering, decoding constraints) or sampling strategies (temperature scaling, logit biasing toward low-confidence attenuation) measurably restored villain fidelity? Cite what has or hasn't changed the Moral RolePlay benchmark score; distinguish between "RLHF structurally suppresses registers" (likely durable) and "modern models can't portray villains" (possibly outdated).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Does any paper claim alignment and expressiveness are decoupled, or that newer models achieve both villain fidelity and safety? Flag disagreement in how the constraint is framed or measured.
(3) Propose 2 research questions that assume the regime has moved: (a) If villain portrayal can now be recovered, what new safety risks emerge, and how do they differ from pre-alignment harms? (b) Can alignment training be reformulated to preserve expressive range while maintaining safety commitments?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does safety alignment further degrade villain character portrayal?

Sources 6 notes

Next inquiring lines