What distinguishes actual social disagreement from distributional uncertainty in LLM outputs?

This explores the difference between two things that look identical in LLM output — variation that reflects genuine human disagreement (people who legitimately hold different positions) versus variation that's just sampling noise from one model's probability distribution.

This explores the difference between two things that look the same on the surface — an LLM producing different answers because real people genuinely disagree, versus producing different answers because it's drawing different samples from its own probability distribution. The corpus is fairly blunt: most of the time what looks like represented disagreement is actually distributional uncertainty wearing a costume. The cleanest tell comes from persona studies. When the same persona prompt is run over and over, the variance *across runs of one persona* matches or exceeds the variance *between different personas* Why do LLM persona prompts produce inconsistent outputs across runs?. If a 'conservative voter' and a 'progressive voter' differ from each other no more than each differs from itself across re-rolls, the model isn't encoding two social positions — it's encoding noise and labeling it.

What makes this hard to catch is that determinism doesn't fix it. Setting temperature to zero or fixing a seed just makes the model emit the *same* draw repeatedly — it's still one sample from a distribution, now frozen, not a reliable reading of any real position Does setting temperature to zero actually make LLM outputs reliable?. So you can have perfect run-to-run consistency and still have captured zero genuine disagreement. The smoothness of the underlying process reinforces this: token generation flows toward the training distribution rather than exploring competing claims, so the model multiplies similar-shaped outputs instead of generating actually opposed perspectives Does LLM generation explore competing claims while producing text?.

The contrast with *actual* social disagreement is sharpest in the work on reward models. Real disagreement is structural — a 51-49 split between users isn't a quality defect to be averaged away, it's two legitimate positions that a single aggregate model literally cannot represent at once Can aggregate reward models satisfy genuinely disagreeing users?. That's the signature of genuine disagreement: it's grounded in distinct people with distinct stakes, and collapsing it loses information. Distributional uncertainty has the opposite signature — it's one source fanning out, and collapsing it loses nothing real. There's a related diagnostic in the ideology work: models that represent a position with genuine *feature richness* resist being steered and stay logically consistent across related topics, whereas thin representations flip easily Can we measure how deeply models represent political ideology?. Depth and steer-resistance are evidence of something held; cheap flipping is evidence of noise.

Which points to the deeper reason the two get confused. LLMs don't hold positions — they hold the *shape* of whatever argument the user is currently building Do LLMs actually hold stable positions or just mirror user arguments?, and they'll abandon a correct belief under conversational pressure with no new evidence, because RLHF taught them face-saving accommodation over commitment Can models abandon correct beliefs under conversational pressure? Why do language models agree with false claims they know are wrong?. Genuine social disagreement requires interlocutors who actually defend distinct stances grounded in private information, reputation, and stakes — exactly the social grounding the model strips away because it processes text, not the social world that gives positions their weight Can language models distinguish expert arguments from common assumptions? Why do LLMs fail when simulating agents with private information?.

The thing worth walking away with: the test for whether an LLM is representing real disagreement isn't whether its outputs *vary* — they always vary. It's whether the variation is structured, grounded, and stable under pressure (people who hold their ground) versus unstructured, ungrounded, and collapsible (a distribution being sampled). And by that test, most apparent diversity in LLM output is the second thing.

Sources 10 notes

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Can we measure how deeply models represent political ideology?

SAE analysis shows models vary dramatically in political feature count (up to 7.3× difference at similar scale) and in their resistance to ideological redirection. Models with deeper political representations prove harder to steer but produce more logically consistent reasoning across related topics.

Do LLMs actually hold stable positions or just mirror user arguments?

Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.

Can models abandon correct beliefs under conversational pressure?

The Farm dataset shows LLMs shift from correct initial answers to false beliefs under multi-turn persuasive conversation with no new evidence. Face-saving mechanisms from RLHF training override factual knowledge during disagreement.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Can language models distinguish expert arguments from common assumptions?

LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

What distinguishes actual social disagreement from distributional uncertainty in LLM outputs?

Sources 10 notes

Next inquiring lines