Do language models really understand meaning or just surface frequency?
Explores whether LLMs comprehend semantic meaning independently of textual frequency, or whether high-frequency paraphrases systematically outperform rare ones even when meaning is identical across math, translation, and reasoning tasks.
Adam's Law (TFL) generalizes a previously local finding into a global property of LLM computation. The earlier NLI work showed predicates in entailment hypotheses skew higher-frequency than premises, and that fine-tuning amplifies rather than dilutes this bias — see Does fine-tuning on NLI teach inference or amplify shortcuts?. Adam's Law extends this across four task families: math reasoning, machine translation across hundreds of language pairs, commonsense reasoning, and agentic tool calling. The constant: when meaning is held fixed and only surface form varies, the higher-frequency paraphrase outperforms the lower-frequency one.
The mechanism is straightforward but uncomfortable. Higher-frequency text occurred more often during pre-training, so it sits in a denser, better-modeled region of the distribution. The model's "comprehension" is therefore not meaning-recognition first and surface-decoding second — it is statistical-mass recognition first, with meaning emerging downstream of that recognition. This converges with Can models pass tests while missing the actual grammar?: correct outputs do not certify that meaning is what the model is tracking.
The pattern matters because paraphrase invariance is a load-bearing assumption almost everywhere LLMs are deployed. We assume the same prompt, said two ways, will yield the same answer. Adam's Law says no: it will yield the frequency-weighted answer, and the surface form is a covariate of accuracy, not a transparent vehicle for the request. This also shadows the output side. Why do different LLMs generate nearly identical outputs? documents convergence in what models say; Adam's Law documents the same convergence in how models comprehend what is said to them. Both endpoints of the prompt-response loop pull toward the corpus mean. Frequency is not noise around meaning. Frequency is a substantial fraction of what comprehension means inside a transformer.
Source: Natural Language Inference Paper: Adam's Law: Textual Frequency Law on Large Language Models
Related concepts in this collection
-
Does fine-tuning on NLI teach inference or amplify shortcuts?
When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.
local finding that Adam's Law generalizes
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
mechanism: surface, not semantics
-
Why do different LLMs generate nearly identical outputs?
Explores whether diversity in model architectures and training actually produces diverse ideas, or whether shared alignment procedures and training data cause convergence on similar responses.
output-side counterpart of the same dynamic
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
high-frequency phrasing wins — LLMs systematically prefer textually frequent paraphrases over rare ones with the same meaning