Why does model uncertainty dominate persona-specific knowledge in annotation tasks?
This explores why, when you ask an LLM to role-play a person and label data, the model's own instability seems to drown out whatever 'knowledge' the persona was supposed to bring.
This explores why, when you ask an LLM to role-play a person and label data, the model's own instability seems to drown out whatever 'knowledge' the persona was supposed to bring. The cleanest evidence comes from a simple test: run the same persona prompt many times and measure how much the output wobbles. It turns out the variation across repeated runs of one persona matches or exceeds the variation between *different* personas Why do LLM persona prompts produce inconsistent outputs across runs?. In other words, if you can't tell whether two answers came from two different people or from the same prompt run twice, the persona isn't doing the work — the model's randomness is. That makes these simulations unreliable for the very thing researchers wanted them for: reproducing how real human annotators disagree.
The deeper reason is that prompting can only stir what's already in the model, not pour in something new. Prompt optimization retrieves and reorganizes a model's existing training distribution but cannot supply knowledge it never learned Can prompt optimization teach models knowledge they lack?. A persona prompt is a bet that somewhere in the weights lives a stable, retrievable 'social self' for this character. When that bet fails, the model falls back on its default behavior, and on uncertain tasks default behavior is noisy. There's a neat predictor of when this happens: prompt sensitivity tracks model confidence. Confident models resist rephrasing and stay put; low-confidence ones swing wildly with tiny prompt changes Does model confidence predict robustness to prompt changes?. Persona-conditioned annotation lands squarely in the low-confidence, high-swing zone.
What makes this especially treacherous is that the model doesn't *know* it's guessing. LLMs lack reliable self-knowledge — their confident outputs don't track their actual accuracy, and users over-trust them anyway How well do language models understand their own knowledge?. So a persona simulation can sound assured while being driven entirely by sampling noise. The annotation literature compounds the problem from the other side: human annotations themselves aren't one thing. They split into genuine preferences, non-attitudes, and on-the-spot constructed preferences, which behave very differently across measurement conditions Do all annotation responses measure the same underlying thing?. A lot of what looks like 'persona-specific' human disagreement is actually people constructing answers in the moment — and an LLM has nothing stable to imitate there, so it improvises, which reads as uncertainty.
Here's the turn that makes this worth knowing: the fix isn't a cleverer persona, it's grounding and calibration. When personas are extracted from real domain documents rather than invented, multi-agent evaluations become reproducible and transfer across tasks Can personas extracted from documents generalize across evaluation tasks?. When user simulators are explicitly trained for consistency with reward signals, persona drift drops by over half Can training user simulators reduce persona drift in dialogue?. And small models trained to *abstain when unsure* beat models ten times larger at forecasting under uncertainty — the calibration ability exists, it's just undertrained in standard LLMs Can models learn to abstain when uncertain about predictions?. Uncertainty dominates persona knowledge not because the trait is unreachable, but because nothing in standard prompting anchors the model or teaches it to flag its own wobble. Give it a real anchor or a reason to abstain, and the persona signal re-emerges from under the noise.
Sources 8 notes
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.