Does setting temperature to zero actually make LLM outputs reliable?
Explores whether deterministic LLM settings that produce consistent outputs also guarantee reliable judgments, and how to measure true reliability beyond surface consistency.
"Can You Trust LLM Judgments?" (2024) introduces a rigorous framework for evaluating LLM-as-a-Judge reliability using McDonald's omega, revealing that the common practice of using fixed seeds and deterministic settings provides false confidence.
The core argument: even with deterministic settings, a single LLM output is one sample from the model's probability distribution. Setting temperature to zero and fixing the seed produces "fixed randomness" — the same output every time, but that output may still be a misleading draw from the distribution. Consistent replication does not guarantee reliability. A perfectly calibrated LLM that says it's 90% confident should be correct 9 out of 10 times — but even a perfectly calibrated LLM can be unreliable if its distribution has high variance.
The framework: prompt the judgment LLM 100 times, varying only the replication while holding all other factors constant. Apply McDonald's omega to assess internal consistency across these replications. This reveals whether the model's judgments are stable properties of the input or artifacts of the sampling process.
The distinction between reliability, confidence, and calibration is critical:
- Calibration: alignment between stated confidence and actual correctness
- Confidence: the model's self-assessed certainty
- Reliability: consistency of judgments across multiple draws
These three are intertwined but distinct. A model can be well-calibrated (confident when right) but unreliable (different answers on different draws). A model can be reliable (always gives the same answer) but poorly calibrated (that consistent answer is wrong).
This connects to Does model confidence predict robustness to prompt changes? — ProSA measures sensitivity to prompt variation, while this measures sensitivity to sampling variation. Both reveal that single evaluations are insufficient. The practical implication: any LLM-as-a-Judge deployment that relies on single-shot evaluation with deterministic settings is providing the illusion of precision without evidence of reliability.
Source: Evaluations
Related concepts in this collection
-
Does model confidence predict robustness to prompt changes?
Explores whether a model's certainty about its answer determines how much it resists prompt rephrasing and semantic variation. This matters because it could explain why some tasks are harder to evaluate reliably.
prompt sensitivity and sampling sensitivity are complementary reliability concerns
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
judge unreliability compounds with exploitable biases
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
calibration failure at the preference model level adds to the reliability problem
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
deterministic LLM settings create fixed randomness not reliability — a single output remains one draw from the model's probability distribution