Is paraphrase invariance a reliable assumption when deploying language models in production?
This explores whether you can safely assume an LLM will treat two ways of saying the same thing as equivalent — and the corpus says no, with surprising consistency.
This reads the question as a practical one: if you swap a prompt for a reworded-but-identical-in-meaning version, can you count on the model behaving the same way? The collection's answer is a fairly emphatic no — and the reason is more mechanical than you might guess. The core finding is that LLMs don't respond to meaning so much as to statistical mass from pretraining: among two semantically identical prompts, the one whose phrasing showed up more often in training data systematically wins on output quality Why do semantically identical prompts produce different LLM outputs?. That effect isn't confined to one task type — the same high-frequency preference appears across math, machine translation, commonsense reasoning, and tool calling, which suggests it's a property of how the model works rather than a quirk of any one domain Do language models really understand meaning or just surface frequency?.
What makes this worth knowing is that the failure is *predictable*, not random. If you frame the model as an autoregressive probability machine, you can forecast in advance which phrasings will do worse: low-probability target responses are harder even when the task is logically trivial Can we predict where language models will fail?. So 'paraphrase invariance' isn't a property the model has and occasionally loses — it's a property it never really had, and you can often anticipate where it'll break.
There are two more wrinkles that matter for production specifically. First, even holding the prompt fixed, the same input can produce different outputs on regeneration — the model maintains a kind of superposition and samples from it rather than committing Do large language models actually commit to a single character?. So variance isn't only across paraphrases; it's across runs of the identical prompt. Second, when a paraphrase happens to nudge the prompt toward strong pretraining associations, those parametric priors can override the actual instruction you gave in-context — and plain textual rewording won't fix it Why do language models ignore information in their context?.
The deeper reason all of this holds: the model is tracking surface patterns rather than deep structure. It misses syntactic complexity that humans handle easily Why do large language models fail at complex linguistic tasks?, and it largely cannot recognize when a phrasing is genuinely ambiguous, disambiguating only about a third of cases where humans hit ninety percent Can language models recognize when text is deliberately ambiguous?. If the model can't reliably tell two readings apart, it certainly can't guarantee two phrasings map to one behavior.
The practical upshot for deployment: don't treat 'users will phrase it differently but mean the same thing' as a safe assumption. Pin prompt templates, test against frequency-varied paraphrases rather than a single canonical wording, and budget for run-to-run variance even on fixed inputs. The thing most teams don't realize they want to know is that the wording sensitivity is a *lever* as much as a liability — high-frequency phrasings measurably outperform, so prompt phrasing is a tunable quality knob, not just a fairness hazard.
Sources 7 notes
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.