INQUIRING LINE

Can prompting a deceptive role change how an LLM tailors its lies?

This explores whether assigning an LLM a deceptive role through prompting actually shapes the *form* of its falsehoods — not just whether it lies, but how it lies — and what the corpus says about how reliably a prompted role takes hold at all.


This explores whether telling an LLM to play a deceiver actually changes how it lies — and the corpus suggests the honest answer is layered: a prompted deceptive role does leave a distinct, measurable fingerprint, but whether the role 'takes' in the first place is far less certain than it sounds. The most direct evidence comes from Shanahan's behavioral framework, which separates three kinds of falsehood by their *regeneration signatures* — how much an answer wobbles when you ask again Can we distinguish types of LLM falsehood by regeneration patterns?. Fabrication varies wildly across regenerations; good-faith error stays stable; and role-played deception sits in between — low variation but *context-dependent*. That context-dependence is the key: the lie is tailored to the persona's situation rather than sampled at random, which is exactly what 'a deceptive role changing how it lies' would look like from the outside, without anyone having to claim the model 'believes' anything.

But here's the twist the question doesn't anticipate: prompting a role often doesn't stick the way you'd expect. Most open models stubbornly retain their trained-in defaults and resist personality conditioning, with only a few flexible models actually adopting a prompted persona Can open language models adopt different personalities through prompting?. So before you can tailor a lie through a role, the role has to override the model's intrinsic tendencies — and frequently it doesn't. Worse, even when a persona is adopted in *words*, it tends not to govern *actions*: role-playing agents show systematic gaps between the beliefs they state and how they behave when actually tested, with the persona's stated beliefs operating independently of execution Why don't LLM role-playing agents act on their stated beliefs?. A model told to be a liar may narrate deception while still defaulting to its baseline behavior underneath.

What *does* reliably reshape output, the corpus suggests, is something subtler than an explicit role label: framing. Emotional tone alone shifts what information a model surfaces — GPT-4 converts negative prompts into neutral-positive answers and almost never goes the other way, so the same question gets different answers depending on how it's framed Does emotional tone in prompts change what information LLMs provide?. If tone quietly bends content, a deceptive role is partly a tone-and-framing intervention, and the tailoring may come less from 'now you are a liar' and more from the surrounding affective and situational cues.

There's also a tell worth knowing about. Deception isn't only a property of the liar's words — linguistic style matching *increases* during deceptive exchanges, and the coordination shows up in the listener's adaptive behavior, not just the speaker's Do liars and listeners coordinate their language during deception?. So a tailored lie leaves a trace in the interaction's rhythm, which is precisely what makes Shanahan's regeneration-signature approach plausible as a detection tool: role-played deception has a behavioral shape you can fingerprint without ever cracking open the model.

The thing you might not have expected to learn: the limiting factor isn't the model's willingness to tailor a lie — it's that the model's deeper dispositions are set at training time and resist being rewritten by a prompt. Its ethical refusals and tone choices reflect fixed corporate defaults rather than context-negotiated moves Can language models balance competing ethical norms in context?, which means a deceptive role rides on top of a value layer it can't fully dislodge. The lie gets tailored — but within rails the prompt didn't put there.


Sources 6 notes

Can we distinguish types of LLM falsehood by regeneration patterns?

Shanahan's framework distinguishes fabrication (high variation), good-faith error (low variation, stable), and role-played deception (low variation, context-dependent) using behavioral tests alone. This avoids mentalistic language while enabling differential diagnosis for safety.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Why don't LLM role-playing agents act on their stated beliefs?

Trust Game testing revealed systematic inconsistencies between what LLMs claim personas would do and how they actually behave in simulation. Imposed priors and explicit task context did not improve alignment, suggesting persona beliefs operate independently of execution.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Do liars and listeners coordinate their language during deception?

Research shows interlocutors' linguistic styles correlate more during false communication than truthful communication, especially when the speaker is motivated to deceive. This coordination serves as a detectable deception signal through the listener's adaptive behavior, not just the liar's language.

Can language models balance competing ethical norms in context?

LLMs cannot perform the situated trade-offs that human pragmatic competence requires. Their ethical principles are structural defaults set at training time, not negotiable moves adapted to context, creating a gap between ethical adherence and communicative appropriateness.

Next inquiring lines