Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs

Paper · arXiv 2408.01417 · Published August 2, 2024

Humans spontaneously use increasingly efficient language as interactions progress, by adapting and forming ad-hoc conventions. This phenomenon has been studied extensively using reference games, showing properties of human language that go beyond relaying intents. It remains unexplored whether multimodal large language models (MLLMs) similarly increase communication efficiency during interactions, and what mechanisms they may adopt for this purpose. We introduce ICCA, an automated framework to evaluate such conversational adaptation as an in-context behavior in MLLMs. We evaluate several state-of-the-art MLLMs, and observe that while they may understand the increasingly efficient language of their interlocutor, they do not spontaneously make their own language more efficient over time. This latter ability can only be elicited in some models (e.g., GPT-4) with heavy-handed prompting. This shows that this property of linguistic interaction does not arise from current training regimes, even though it is a common hallmark of human language.

Human interlocutors adapt to each other during interactions, developing increasingly efficient ways to refer to concepts and objects. Hawkins et al. (2020b) exemplify this via communication between a nurse and a bed-ridden patient at home. Initially, the patient may refer to a medicine with the medicine for my back pain in a small blue medicine bottle ..., but after a week of care, they are likely to just ask for their back meds. This increase in efficiency relies on the interlocutors forming ad-hoc linguistic conventions: the mutually understood, concise phrases to communicate referential content. This phenomenon has been repeatedly observed and characterized in controlled studies using repeated reference games (Figure 1; e.g., Krauss & Weinheimer, 1964; Brennan & Clark, 1996; Hawkins et al., 2020a).

We study this ability in multimodal large language models (MLLMs). LLMs and MLLMs are well positioned to acquire this behavior and display it spontaneously in interactions. They are trained on large amounts of human language data, in which this behavior is common and the history of an ongoing interaction is often retained, thereby explicitly keeping the information needed at hand. Beyond the scientific question, such ad-hoc adaptation has significant application impacts: enabling more natural interactions, reducing the costs involved in conversations (e.g., using shorter utterances to communicate the same amount of information), and increasing the accuracy of relaying intent.

We propose ICCA,1 an automated framework to evaluate and characterize the ability of models to form ad-hoc conventions. ICCA uses a corpus of human-human reference game interactions, allowing for completely automated evaluation, which does not require further human interaction, making it easy to deploy for the analysis of new models.

When acting as a listener, GPT4 displays adaptation trends close to humans, improving its accuracy as the interaction progresses, while other models show this behavior to a lesser degree or only under some simplified setups. Overall, we show that while today’s MLLMs may passively understand the evolving language of their interlocutor, the ability to adapt their own language for efficient communication does not naturally emerge from their training or instruction-tuning. This outlines important future research problems. We release ICCA under the MIT license

Our design does not require collecting new data or human studies, but instead uses Hawkins et al. (2020b)’s human-human interaction data to simulate a human interacting with an MLLM

While this was done using GloVe embeddings (Pennington et al., 2014) in past work (Hawkins et al., 2020a), we design a new metric called Word Novelty Rate (WNR), which is sensitive to exact word choices. WNR is a modified word error rate that only counts insertions and substitutions, and ignores deletions. It is motivated by how people naturally drop words from their messages as the interaction progresses (Hawkins et al., 2020a), whereas additions and substitutions of words often reflect important changes in information based on our observations. Compared to GloVe, WNR is more sensitive to lexical inconsistencies that can increase the listener’s cognitive load.

S1: Standard Speaker The standard speaker setup (Section 2). The model speaker only receives the basic game instruction, with no mention of communication efficiency. Figure 6 in the appendix shows an example prompt.

S2: Gricean Instruction A relatively light-handed and general way to introduce the expected convention formation behavior is to explicitly instruct the model to follow the Gricean quantity maxim. This kind of instruction is not specific to reference games, and does not explicitly mention message length. Its focus is information, and it entails that cooperative interlocutors would provide enough information to identify the referent but would not make the message more informative than necessary. We add additional instructions based on the maxim and further instruct the model to think about how the amount of information needed may change as more trials are completed and based on the listener’s performance in previous trials.4,5

S3: Explicit Instruction We instruct the model to explicitly reduce message length as the interaction progresses. Unlike S1 and S2, this instruction is specific to reference games, as language adaptation in other scenarios is not necessarily accompanied by length reduction (Effenberger et al., 2021). We add to S1 an explicit instruction to reduce utterance length: as more trials are completed and as the listener understands you better, gradually condense your messages, making them shorter and shorter every trial.5

S4: Explicit Instruction + Consistency Request Convention formation in reference games is not only characterized by reduction in utterance length, but also by lexical consistency. This variant explicitly instructs the model to follow this pattern. Similar to S3, it is specific to the repeated reference game setup and its use of a repeating context. We add to S3 the instruction: when creating a shorter message for an image, try to extract salient tokens from the previous messages for this image rather than introducing new words. The short messages should still allow the listener to choose the target correctly. For each image, when you reach a message that cannot be further shortened, you should keep using that message for the rest of the game.5

Figure 2 shows the results for all variants, along with properties of the human messages from Hawkins et al. (2020b), which were collected using the standard setup. We report mean message length, WNR, and listener accuracy for each repetition. Overall, all models fail to spontaneously improve communication efficiency. It is only with fairly heavy-handed instruction that GPT4, Gemini, and Claude show adaptation trends similar to humans.