Why do models confabulate inconsistently across different samples?
This explores why a model's made-up answers vary from one run to the next — why the same prompt yields a confident fabrication one time and a different (or correct) answer the next — rather than treating confabulation as a fixed flaw.
This explores why a model's made-up answers vary from one run to the next, rather than treating confabulation as a single fixed defect. The corpus points to a simple root: every answer a model gives is one *draw* from a probability distribution, not a lookup of a stored fact. Even with temperature pinned to zero, you're just re-rolling the same loaded die — the output is still a single sample that happens to repeat, which is why deterministic settings produce consistency without producing reliability Does setting temperature to zero actually make LLM outputs reliable?. When the model actually *knows* something, that distribution is sharply peaked and every sample lands in the same place. When it doesn't, the distribution is diffuse, and each sample wanders somewhere different. Confabulation is what diffuse sampling looks like from the outside.
A vivid version of this comes from the 'simulator' framing: an LLM doesn't commit to one character or one belief but holds a superposition of plausible continuations, sampling a fresh one each time you regenerate Does an LLM commit to a single character or maintain many?. That's exactly why persona prompts buckle — when you run the same persona repeatedly, the spread across runs matches or exceeds the spread across *different* personas, revealing that raw model uncertainty, not any stable knowledge, is steering the output Why do LLM persona prompts produce inconsistent outputs across runs?. The inconsistency isn't noise on top of a real answer; it *is* the signal that there was no settled answer underneath.
This is precisely why inconsistency turns out to be *useful* rather than merely annoying. Semantic entropy detects confabulations by sampling several answers, clustering them by whether they mean the same thing, and measuring how scattered the meanings are — high scatter flags a fabrication, no task-specific training required Can we detect when language models confabulate?. The cross-sample variance you're asking about is the detector. The deeper cause of *where* the distribution goes diffuse shows up in work on reasoning failure: models break not at some complexity threshold but at instance-level *unfamiliarity* — they pattern-match to training instances rather than running a general algorithm, so an unfamiliar input drops them into the high-uncertainty regime where samples diverge Do language models fail at reasoning due to complexity or novelty?. Relatedly, the true risk lives in unseen *combinations* of entities in the pretraining data — combinations the model never saw co-occur are exactly where it improvises, and improvisation samples differently each time Can pretraining data statistics detect hallucinations better than model confidence?.
Worth knowing: this isn't a bug a better model will sand away. Formal results prove that any computable LLM must hallucinate on infinitely many inputs, and that internal self-correction can't eliminate it — the variability is structural, which is why external safeguards (retrieval triggers, entropy checks) are necessary rather than optional Can any computable LLM truly avoid hallucinating?. The reframe the corpus offers is the thing you didn't know you wanted: stop treating cross-sample inconsistency as a failure to be suppressed, and start treating it as the most honest confidence signal the model gives you. A consistent answer might still be wrong, but a *scattered* one is the model telling you, structurally, that it's guessing.
Sources 7 notes
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.