Why do semantically identical prompts produce different LLM outputs?

Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.

Note · 2026-05-02 · sourced from Natural Language Inference

Cao et al. (2024) showed prompts with the same meaning give very different output quality. Adam's Law isolates frequency as a primary variable in that variance: when paraphrase pairs are matched on meaning but differ on sentence-level corpus frequency, the higher-frequency variant systematically wins. This converts a known phenomenon — prompt sensitivity — from a vague reliability concern into a specific architectural claim about what the model is actually responding to.

The implication for Does model confidence predict robustness to prompt changes? is direct but complicating. Confidence-based accounts read prompt sensitivity as model uncertainty fluctuating across surface variations. Adam's Law inserts a deeper variable: even at fixed model confidence, frequency mass differs across paraphrases because pre-training exposure differs, and that exposure asymmetry shapes the prediction independent of how confident the model "feels." Confidence and frequency are entangled, but frequency is the more upstream cause.

For a Language-as-Event frame, this is load-bearing. A prompt is not a transparent vessel that hands meaning to the model. It is a token sequence whose statistical mass relative to pre-training shapes how the model parses the request before any semantic interpretation occurs. Two synonymous sentences are not the same event. They are two different statistical encounters that happen to share a meaning a human would assign them. The model registers the encounter; meaning is what we read into the registration. This connects to Can models pass tests while missing the actual grammar? — when surface and meaning compete, surface wins by construction.

A practical corollary: prompt-engineering as a discipline is partly a folk practice of frequency optimization. "Phrase it like a textbook" or "rewrite the prompt the way StackOverflow would phrase it" are intuitive moves toward higher-frequency surface forms. Adam's Law gives that folk practice a name and a mechanism — and a warning, because frequency-tuning a prompt does not improve the model's reasoning; it just moves the request into the model's denser distributional region.

Source: Natural Language Inference Paper: Adam's Law: Textual Frequency Law on Large Language Models

Related concepts in this collection

Does model confidence predict robustness to prompt changes? Explores whether a model's certainty about its answer determines how much it resists prompt rephrasing and semantic variation. This matters because it could explain why some tasks are harder to evaluate reliably.
confidence framing complicated by frequency as deeper variable
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
surface dominates when surface and meaning compete

Concept map

13 direct connections · 116 in 2-hop network ·medium cluster

Why do semantically identical prompts produce di… Does model confidence predict robustness to prompt… Can models pass tests while missing the actual gra…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

paraphrase equivalence is a fiction — same-meaning prompts produce different LLM outputs because frequency, not semantics, drives the prediction