Reinforcement Learning for LLMs Language Understanding and Pragmatics LLM Reasoning and Architecture

Does setting temperature to zero actually make LLM outputs reliable?

Explores whether deterministic LLM settings that produce consistent outputs also guarantee reliable judgments, and how to measure true reliability beyond surface consistency.

Note · 2026-03-28 · sourced from Evaluations
How do reasoning models actually fail under pressure?

"Can You Trust LLM Judgments?" (2024) introduces a rigorous framework for evaluating LLM-as-a-Judge reliability using McDonald's omega, revealing that the common practice of using fixed seeds and deterministic settings provides false confidence.

The core argument: even with deterministic settings, a single LLM output is one sample from the model's probability distribution. Setting temperature to zero and fixing the seed produces "fixed randomness" — the same output every time, but that output may still be a misleading draw from the distribution. Consistent replication does not guarantee reliability. A perfectly calibrated LLM that says it's 90% confident should be correct 9 out of 10 times — but even a perfectly calibrated LLM can be unreliable if its distribution has high variance.

The framework: prompt the judgment LLM 100 times, varying only the replication while holding all other factors constant. Apply McDonald's omega to assess internal consistency across these replications. This reveals whether the model's judgments are stable properties of the input or artifacts of the sampling process.

The distinction between reliability, confidence, and calibration is critical:

These three are intertwined but distinct. A model can be well-calibrated (confident when right) but unreliable (different answers on different draws). A model can be reliable (always gives the same answer) but poorly calibrated (that consistent answer is wrong).

This connects to Does model confidence predict robustness to prompt changes? — ProSA measures sensitivity to prompt variation, while this measures sensitivity to sampling variation. Both reveal that single evaluations are insufficient. The practical implication: any LLM-as-a-Judge deployment that relies on single-shot evaluation with deterministic settings is providing the illusion of precision without evidence of reliability.


Source: Evaluations

Related concepts in this collection

Concept map
12 direct connections · 125 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

deterministic LLM settings create fixed randomness not reliability — a single output remains one draw from the model's probability distribution