Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
A persistent challenge in NLI annotation is that human annotators genuinely disagree — not from error, but because the same sentence carries different readings for people with different social positions, ideological backgrounds, or domain expertise. The proposed solution: instruct LLMs to simulate different annotator personas and generate a distribution of labels that reflects human disagreement.
The approach fails for a specific reason: LLM outputs under persona prompting are not stable enough across runs to be meaningful as persona simulations. When the same persona prompt ("respond as a conservative rural voter", "respond as a medical professional") is run multiple times on the same input, the variance in the output distribution across runs is comparable to or larger than the variance across different personas. This means model uncertainty is dominating persona-specific knowledge — the spread in outputs reflects what the model doesn't confidently know, not what different social groups actually think differently.
This is a different diagnosis from simply "LLMs don't know what different groups believe." The more precise claim is: even if the model has relevant group-specific information, it is not stably retrievable under the persona prompt. The persona acts more like a temperature modifier (loosening the output distribution) than a grounding anchor (fixing the output to a specific knowledge domain).
The implication for NLI research methodology is significant: persona-based annotation simulation cannot substitute for actual diverse human annotation panels. The goal was to cheaply approximate human annotation disagreement distributions; the actual output approximates model uncertainty distributions, which have a different shape and origin.
This connects to Why do language models fail confidently in specialized domains? — both findings point to the same underlying gap: LLMs produce confidently-framed outputs even when their underlying representations are uncertain or thin. In overconfidence, the model is wrong and certain; in persona instability, the model is uncertain and generates that uncertainty as if it were persona variance.
The broader implication for Why do readers interpret the same sentence so differently? is that the multiplicity of interpretations is grounded in actual social diversity, not just distributional uncertainty. LLMs can approximate the form of disagreement (varied outputs) but not the substance (stable group-grounded positions). When this instability is applied to evaluation, Why do LLM judges fail at predicting sparse user preferences? identifies persona sparsity as the specific mechanism: run-to-run variance overwhelms persona variance because sparse persona profiles cannot constrain model predictions — the uncertainty documented here is the root cause of personalized judge failure.
Enrichment (2026-02-22, from Arxiv/Personas Personality): Instability is one of three persona failure modes. The "Open Models, Closed Minds" study identifies a complementary failure: resistance — most open LLMs retain their intrinsic ENFJ-like personality despite persona conditioning, failing to shift to the target personality at all. See Can open language models adopt different personalities through prompting?. The third failure mode is cognitive distortion: when persona assignment DOES take hold, it induces motivated reasoning — political personas are up to 90% more likely to validate identity-congruent evidence. See Do personas make language models reason like biased humans?. Together these form a three-way persona failure taxonomy: instability (this note), resistance (closed-minded), and distortion (motivated reasoning).
Source: Natural Language Inference
Related concepts in this collection
-
Why do language models fail confidently in specialized domains?
LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?
both findings show LLM outputs don't reliably track underlying epistemic state
-
Why do readers interpret the same sentence so differently?
How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.
human disagreement is socially grounded; persona simulation cannot replicate that grounding
-
Do classical knowledge definitions apply to AI systems?
Classical definitions of knowledge assume truth-correspondence and a human knower. Do these assumptions hold for LLMs and distributed neural knowledge systems, or do they need fundamental revision?
unstable persona outputs are another manifestation of LLMs lacking the social situatedness that grounds stable perspective-taking
-
Can open language models adopt different personalities through prompting?
Explores whether open LLMs can be conditioned to mimic target personalities via prompting, or whether they resist and retain their default traits regardless of instructions.
complementary failure: resistance vs instability
-
Do personas make language models reason like biased humans?
When LLMs are assigned personas, do they develop the same identity-driven reasoning biases that humans exhibit? And can standard debiasing techniques counteract these effects?
third failure mode: when personas take hold, they introduce cognitive biases
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm persona-simulated annotations are unstable across runs indicating model uncertainty dominates persona-specific knowledge