What hidden assumptions drive how we build language models?
Large language models rest on two unstated assumptions about language and data. Understanding what engineers assume—and what enactive linguistics challenges—matters for knowing what LLMs actually can and cannot do.
"Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency" (Cuffari et al. 2024, arXiv:2407.08790) identifies two implicit assumptions that underpin LLM engineering:
Language completeness — that there exists a "thing" called language that is complete, stable, quantifiable, and available for extraction from traces in the environment. The engineering problem then becomes how to reproduce it artificially.
Data completeness — that all essential characteristics of language can be represented in the datasets used to train the model. All essential characteristics of language use are assumed present within the relationships between tokens.
The enactive view rejects both. Language is not a "thing" to be captured by text data but a practice in which to participate. Every linguistic act is radically incomplete — partial in two senses: (a) it is always made in response to or anticipation of another's response within a shared ongoing activity, and (b) while it manages tension at one level, it introduces new tensions at others that drive the interaction forward. Language is "more a flowing river" than "a large and growing heap" — once you remove water from the river, no matter how large a sample, it is no longer the river.
This pairs with the paper's distinction between hallucination (perceptual failure — inapplicable to LLMs which do not perceive), confabulation (narrative production bearing no relation to reality — also inapplicable), and fabrication (generation of sensible-seeming text from corpus statistics with no internal mechanism distinguishing true from false outputs). The last term is preferred because the same process produces both accurate and inaccurate text.
The language completeness assumption is the deeper critique because it precedes any debate about model capability — if language itself is not the kind of thing that can be comprehensively modeled, then no amount of scaling resolves the gap. This connects to What makes linguistic agency impossible for language models? and extends it by naming the specific architectural assumptions that make the gap invisible to engineers.
Original note title
LLM engineering assumes language completeness and data completeness — both rejected by enactive linguistics