Language Understanding and Pragmatics

Why do semantically identical prompts produce different LLM outputs?

Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.

Note · 2026-05-02 · sourced from Natural Language Inference
Why do LLMs fail at understanding what remains unsaid? What grounds language understanding in systems without embodiment?

Cao et al. (2024) showed prompts with the same meaning give very different output quality. Adam's Law isolates frequency as a primary variable in that variance: when paraphrase pairs are matched on meaning but differ on sentence-level corpus frequency, the higher-frequency variant systematically wins. This converts a known phenomenon — prompt sensitivity — from a vague reliability concern into a specific architectural claim about what the model is actually responding to.

The implication for Does model confidence predict robustness to prompt changes? is direct but complicating. Confidence-based accounts read prompt sensitivity as model uncertainty fluctuating across surface variations. Adam's Law inserts a deeper variable: even at fixed model confidence, frequency mass differs across paraphrases because pre-training exposure differs, and that exposure asymmetry shapes the prediction independent of how confident the model "feels." Confidence and frequency are entangled, but frequency is the more upstream cause.

For a Language-as-Event frame, this is load-bearing. A prompt is not a transparent vessel that hands meaning to the model. It is a token sequence whose statistical mass relative to pre-training shapes how the model parses the request before any semantic interpretation occurs. Two synonymous sentences are not the same event. They are two different statistical encounters that happen to share a meaning a human would assign them. The model registers the encounter; meaning is what we read into the registration. This connects to Can models pass tests while missing the actual grammar? — when surface and meaning compete, surface wins by construction.

A practical corollary: prompt-engineering as a discipline is partly a folk practice of frequency optimization. "Phrase it like a textbook" or "rewrite the prompt the way StackOverflow would phrase it" are intuitive moves toward higher-frequency surface forms. Adam's Law gives that folk practice a name and a mechanism — and a warning, because frequency-tuning a prompt does not improve the model's reasoning; it just moves the request into the model's denser distributional region.


Source: Natural Language Inference Paper: Adam's Law: Textual Frequency Law on Large Language Models

Related concepts in this collection

Concept map
13 direct connections · 116 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

paraphrase equivalence is a fiction — same-meaning prompts produce different LLM outputs because frequency, not semantics, drives the prediction