What happens when we treat LLM outputs as sampled rather than stored?

This explores a reframing: that every LLM output is a *draw* from a probability distribution shaped by training, not a fact *looked up* in a store — and what that one shift explains about reliability, hallucination, and why phrasing changes answers.

This explores what changes when you stop picturing an LLM as a database it queries and start picturing it as a sampler — every answer is one draw from a probability distribution, not a record fetched from storage. That single move reorganizes a surprising amount of the corpus's findings.

Start with the most counterintuitive consequence: even pinning the model down doesn't make it reliable. Setting temperature to zero with a fixed seed gives you the *same* output every time — but that output is still just one draw, the most probable one, and 'most probable' is not the same as 'correct' Does setting temperature to zero actually make LLM outputs reliable?. Repeated identical answers feel like certainty and are really just a frozen sample. The sampling frame tells you to ask 'how is the whole distribution shaped?' rather than 'what did it say?'

And the distribution is shaped by frequency, not meaning. Two prompts that mean exactly the same thing produce systematically different-quality answers because the model registers the *statistical mass* a phrasing carried in pre-training — higher-frequency wordings win Why do semantically identical prompts produce different LLM outputs?. A storage model can't explain that (a database returns the same record regardless of how you phrase the lookup); a sampling model predicts it exactly. The same lens predicts *where* models fail: framing them as autoregressive probability machines let researchers correctly forecast that logically trivial tasks with low-probability target strings — counting letters, reciting the alphabet backwards — would be systematically hard Can we predict where language models will fail?. The difficulty isn't logical, it's distributional.

This is also why hallucination won't go away. If output is sampling, there is always nonzero probability mass on wrong continuations — and indeed three formal theorems show every computable LLM must hallucinate on infinitely many inputs, no architecture exempt Can any computable LLM truly avoid hallucinating?. A retrieval system can in principle return 'not found'; a sampler always returns *something*. It explains the iterative-method failures too: asked to optimize, models don't *execute* a procedure, they recognize a template and emit plausible-looking sampled values that are often wrong Do large language models actually perform iterative optimization?. And it reframes the human-language comparison — people use language to address one another, while the model produces strings by drawing from a distribution; same surface, different operation underneath Are language models and human speakers doing the same thing?.

The quietly alarming part is what sampling does over time. Each step's draw becomes the next step's input, so errors don't average out — they compound. Frontier models silently corrupt about a quarter of document content across long delegated relays, never plateauing through 50 round-trips Do frontier LLMs silently corrupt documents in long workflows?, and in multi-turn conversation a single early bad draw — a premature assumption — locks in and can't be recovered Why do language models fail in gradually revealed conversations?. If outputs were stored facts, a wrong one would just sit there inertly. Because they're sampled and fed forward, a wrong draw becomes the seed of the next one. The takeaway you didn't know you wanted: most things people call 'reliability problems' are really *sampling* problems, and the fixes that work are the ones that constrain or verify the draw from outside the model rather than hoping the next sample lands right.

Sources 8 notes

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can any computable LLM truly avoid hallucinating?

Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Are language models and human speakers doing the same thing?

LLMs produce strings via probability distributions; humans use language to address and relate to others. They share surface form but differ in what produces output, what it does socially, and what receivers should do with it.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

What happens when we treat LLM outputs as sampled rather than stored?

Sources 8 notes

Next inquiring lines