Language Understanding and Pragmatics LLM Reasoning and Architecture

Does high-frequency text homogenize user input before generation?

Does Adam's Law reveal how LLMs flatten distinctive user voices at the parsing stage, not just in output? This matters because it could explain why model accuracy and generic responses emerge from the same mechanism.

Note · 2026-05-02 · sourced from Natural Language Inference
Where exactly does language competence break down in LLMs? Why do LLMs fail at understanding what remains unsaid?

Adam's Law surfaces a tension that earlier homogenization research could not localize. Why do different LLMs generate nearly identical outputs? documents output convergence; How much of the internet is AI-generated now? tracks that convergence at internet scale; Do LLMs compress concepts more aggressively than humans do? describes the representational mechanism. What was missing was an input-side account: how distinct user voices get flattened before the model starts generating.

Adam's Law supplies it. The model prefers high-frequency surface forms at the comprehension stage. Users iteratively rephrase their prompts toward higher quality, which empirically means toward higher frequency, which means toward median register. Distinct prompts — a domain expert's specialized phrasing, a regional dialect, a technical idiolect — get pre-processed by the user's own paraphrasing toward whatever phrasing the model handles best, which is whatever phrasing the corpus contained most. Homogenization happens in the parsing of the request, not just in the generation of the response.

The tension is sharp: the same property that gives LLMs their accuracy on common tasks — strong modeling of dense distributional regions — is the property that filters out distinctiveness on the input side. There is no cheap fix because the mechanism is constitutive of how the model works, not a bug in a post-processing layer. Tokenization-of-intelligence, in this frame, is tokenization toward the corpus mean; the input channel and the output channel both narrow toward the high-frequency centroid. A user with a distinctive voice trying to use the model effectively is in an asymmetric trade: speak distinctively and lose accuracy, or speak in the model's preferred register and lose voice. There is no third option that the architecture provides.


Source: Natural Language Inference Paper: Adam's Law: Textual Frequency Law on Large Language Models

Related concepts in this collection

Concept map
13 direct connections · 97 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

high-frequency text is the homogenization channel — the same mechanism that makes LLMs accurate also makes them generic