What does the 20-questions test reveal about LLM character consistency?
This explores Shanahan's '20-questions regeneration test' — a thought experiment about whether an LLM is really 'being' one character — and what it tells us about how (in)consistent LLM personas actually are.
This explores Shanahan's '20-questions regeneration test' and what it reveals about whether an LLM ever truly commits to a single character. The short version: it doesn't. In the classic setup, you play 20 questions with the model — but because the model never wrote down an answer, regenerating its reply yields a *different* secret object each time, every one consistent with the questions asked so far. The test Do large language models actually commit to a single character? falsifies the intuition that there's a fixed 'someone' behind the responses. Instead, the model holds a *superposition* of many consistent characters and samples one at generation time Does an LLM commit to a single character or maintain many?. Consistency, when you see it, is a property of what's been said so far narrowing the distribution — not of an underlying commitment.
The striking part is that this isn't just a quirk of playing games. The same instability shows up when researchers try to use personas for real work. Run the same persona prompt many times and the variance *between runs* can match or exceed the variance *between different personas* Why do LLM persona prompts produce inconsistent outputs across runs? — meaning the noise from resampling can drown out the signal of the persona itself. And the obvious fix — set temperature to zero — doesn't rescue you: it just freezes one draw from the distribution, which looks reliable but is still a single arbitrary sample Does setting temperature to zero actually make LLM outputs reliable?. The 20-questions test names a structural fact that these empirical results then confirm.
What you make of this depends on how far you push it. Shanahan's strong reading is that it's role-play all the way down — there's no authentic voice underneath, no hidden true self that jailbreaking reveals, only the training data's full spectrum Does a language model have an authentic voice underneath?. But there's a live counter-position worth knowing about: a 'quasi-realizationist' view argues post-training actually *installs* robust personas that resist adversarial pressure and behave like substrate-level dispositions, so the persona is realized rather than merely performed Are LLM personas realized or merely simulated through training?. The 20-questions test cuts hardest against the naive 'fixed character' view; it leaves this more sophisticated debate open.
Where it gets practically interesting is that character *can* be made stickier — at a cost. Persona consistency tends to trade off against staying on-topic: high persona-adherence scores often come from the model parroting its character description while ignoring the actual conversation Do persona consistency metrics actually measure dialogue quality?. And consistency isn't morally neutral — safety alignment monotonically erodes a model's ability to inhabit villains, substituting crude aggression for nuanced malevolence, so the 'character' you get is partly an artifact of training pressures Does safety alignment harm models' ability to roleplay villains?. Meanwhile, narrative grounding pulls the other way: give a model a character's retrieved memories and psychology and it predicts that character's choices markedly better Can LLMs predict character choices from narrative context?. So consistency turns out to be something you *engineer through context*, not something the model possesses.
The thing you didn't know you wanted to know: the 20-questions test reframes 'is the model being consistent?' into the wrong question. The model is always sampling. What looks like a stable character is the conversation history having quietly collapsed a cloud of possible characters into a narrow band — which is also why a model can ace structured persona tasks yet default to surface-level mimicry the moment things go open-ended Do large language models genuinely simulate mental states?.
Sources 10 notes
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Shanahan argues that base LLMs lack agency, beliefs, or preferences—the simulator is pure role-play with no underlying subject. Jailbreaking reveals the training data's full spectrum, not a hidden true self; even RLHF personas are performed characters, never realized quasi-psychologies.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
High persona adherence scores often come from copying character descriptions while ignoring query relevance. MUDI jointly optimizes both by using discourse relations and graph-based coherence modeling alongside persona fidelity, showing that persona and context must be optimized together, not separately.
The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.
The LIFECHOICE benchmark (1,462 decisions across 388 novels) shows LLMs predict character choices better when given expert-written persona profiles paired with retrieved memories relevant to the character's psychology. This persona-based approach outperforms automated summarization by 5%.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.