Psychology and Social Cognition Language Understanding and Pragmatics

Can we distinguish types of LLM falsehood by regeneration patterns?

Does observing how an LLM's outputs vary when regenerated—rather than inferring intent—allow us to tell apart fabrication, good-faith error, and deliberate deception? This matters for diagnosing safety risks.

Note · 2026-04-15 · sourced from Role-Play with Large Language Models
What kind of thing is an LLM really?

Shanahan maps the three human categories of false assertion — honest mistake, good-faith error, and deliberate deception — onto dialogue agents without attributing propositional attitudes to the system. The result is a behavioral taxonomy rather than a mental-state one.

An agent that simply fabricates shows high semantic variation when regenerated in the same context — it is not tracking a stable referent but producing plausible continuations. An agent that says something false "in good faith" — role-playing a knowledgeable character whose training-data cutoff makes the information outdated — shows low semantic variation on regeneration: it consistently generates the same wrong answer because that answer is reliably encoded in its weights for that context. An agent that is role-playing a deceptive character — prompted to mislead, e.g. a dishonest car salesman — also shows low variation within a context but different answers across contexts, because the deception involves tailoring the lie to what each interlocutor knows.

The regeneration-variation signature provides a behavioral test that distinguishes these three modes without ever asking what the system "really" believes or intends. This is the role-play framework's practical payoff: it enables differential diagnosis of false output using observable behavior rather than mentalistic attribution. The taxonomy also exposes why "hallucination" is a poor label for all three phenomena — conflating fabrication, good-faith error from stale weights, and role-played deception under a single mentalistic term obscures real behavioral differences that matter for safety and deployment.


Source: Shanahan, McDonell & Reynolds, Role-Play with Large Language Models (May 2023)

Related concepts in this collection

Concept map
12 direct connections · 90 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

dialogue-agent deception is a role-play category — good-faith and deliberate falsity differ by semantic variation across regenerations not by propositional attitude