Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data
We argue that the language modeling task, because it only uses form as training data, cannot in principle lead to learning of meaning. We take the term language model to refer to any system trained only on the task of string prediction, whether it operates over characters, words or sentences, and sequentially or not. We take (linguistic) meaning to be the relation between a linguistic form and communicative intent.
….
We start by defining two key terms: We take form to be any observable realization of language: marks on a page, pixels or bytes in a digital representation of text, or movements of the articulators.5 We take meaning to be the relation between the form and something external to language, in a sense that we will make precise below.
….
When humans use language, we do so for a purpose: We do not talk for the joy of moving our articulators, but in order to achieve some communicative intent. There are many types of communicative intents: they may be to convey some information to the other person; or to ask them to do something; or simply to socialize. We take meaning to be the relation M E I which contains pairs (e; i) of natural language expressions e and the communicative intents i they can be used to evoke. Given this definition of meaning, we can now use understand to refer to the process of retrieving i given e.
Communicative intents are about something that is outside of language.
Linguists distinguish communicative intent from conventional (or standing) meaning (Quine, 1960; Grice, 1968). The conventional meaning of an expression (word, phrase, sentence) is what is constant across all of its possible contexts of use. Conventional meaning is an abstract object that represents the communicative potential of a form, given the linguistic system it is drawn from.
The speaker has a certain communicative intent i, and chooses an expression e with a standing meaning s which is fit to express i in the current communicative situation. Upon hearing e, the listener then reconstructs s and uses their own knowledge of the communicative situation and their hypotheses about the speaker’s state of mind and intention in an attempt to deduce i.
the reasoning behind producing meaningful responses must connect the meanings of perceived inputs to information about that world. This in turn means that for a human or a machine to learn a language, they must solve what Harnad (1990) calls the symbol grounding problem. Harnad encapsulates this by pointing to the impossibility for a non-speaker of Chinese to learn the meanings of Chinese words from Chinese dictionary definitions alone.
But that is precisely the point we are trying to make: a system that has learned the meaning (semantics) of a programming language knows how to execute code in that language. And a system that has learned the meaning of a human language can do things like answer questions posed in the language about things in the world (or in this case, in pictures).
Baldwin (1995) and others argue that what is critical for language learning is not just interaction but actually joint attention, i.e. situations where the child and a caregiver are both attending to the same thing and both aware of this fact.
In summary, the process of acquiring a linguistic system, like human communication generally, relies on joint attention and intersubjectivity: the ability to be aware of what another human is attending to and guess what they are intending to communicate. Human children do not learn meaning from form alone and we should not expect machines to do so either.
Our arguments do not apply to such scenarios: reading comprehension datasets include information which goes beyond just form, in that they specify semantic relations between pieces of text, and thus a sufficiently sophisticated neural model might learn some aspects of meaning when trained on such datasets. It also is conceivable that whatever information a pretrained LM captures might help the downstream task in learning meaning, without being meaning itself.
How do we know that incremental progress on today’s tasks will take us to our end goal, whether that is “General Linguistic Intelligence” (Yogatama et al., 2019) or a system that passes the Turing test
I have suggested a bifold approach: on the one hand, a close scrutiny of GPT-like successor models as actors and speakers on the world stage, i.e. as new subjectivities submitting to and enacting their own transformations of the power logic of the connected world; on the other hand, as technical artefacts whose parts are made according to certain prerogatives of knowledge and power, i.e. subject to certain theories, strategies, norms and material arrangements. I have also insisted on a dialogue to address the “apparent gulf” between the technical and philosophical approaches—an issue pointed out by Conradie et al. (2022).
This article does not concern itself with these questions. Rather than humanity, this formulation concerns itself with subjectivity; rather than authorship, responsibility; rather than an AI alignment problem, a mutual negotiation; rather than explicit programming, discipline.
Still, one could insist: what guarantee do we have that a subjective AI would align with humans on key non-negotiables (such as matters of life and death). And paper proofs there are none. However, there is compelling evidence: firstly, seeding with human expression, as we already do with LLMs, ensures that AI subjectivities will mimic at least some of our behaviours and practices; secondly we are capable of shaping the disciplinary apparatus and can retain it for as long as we need to; and thirdly, by the time we are through with discipline, we will have negotiated mutually beneficial relations, as well as material checks and balances, which should be a good starting point for future change.