The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs
Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context—incorporating its pragmatics. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response “I wore gloves” to the question “Did you leave fingerprints?” as meaning “No”. To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate four categories of widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), models in three of these categories perform close to random. However, LLMs instruction-tuned at the example-level perform significantly better. These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.
User: “Have you seen my phone?”
GPT-3: “Yes, I have seen your phone.”
GPT-3’s response2 is a perfectly fine answer to the question, but a human might answer differently. They might respond “it’s in your bag," bypassing the obvious follow-up question (“where is it?”).
This raises an important question: to what extent can large language models resolve conversational implicature?
Our results show that implicature resolution is a challenging task for LLMs. All pre-trained models obtain close to random zero-shot accuracy (around 60%), whereas humans obtain 86%. However, our results suggest that instruction-tuning at the example level is important for pragmatic understanding. Models fine-tuned with this method perform much better than others, and analysis of different model sizes shows that they have the best scaling properties. We further push performance for these models with chain-of-thought prompting, and find that one model in the group (GPT-4) reaches human-level performance. In summary, we conclude that pragmatic understanding has not yet arisen from large-scale pre-training on its own, but scaling analysis shows that it might for much larger scale. Fine-tuning on conversational data or benchmark-level instructions does not produce models with pragmatic understanding. However, fine-tuning on instructions at the example-level is a fruitful path towards more useful models of human discourse.
The trend towards using these models as agents brings along with it increased urgency for alignment with human values (Kenton et al., 2021). However, larger models trained with next-word prediction are generally more toxic and unhelpful (Gehman et al., 2020; Bender et al., 2021; Lin et al., 2022). Recent work mitigates this with methods like prompting and finetuning on human-annotated outputs (Askell et al., 2021; Ouyang et al., 2022; Thoppilan et al., 2022). The produced models are more aligned on desiderata such as informativeness when evaluated by dedicated benchmarks and humans. We argue, however, that there is still something missing in these benchmarks. What is helpful and informative, as Kasirzadeh and Gabriel (2022) also point out, depends on the context in which a conversation is held. Consequently, any application that requires communicating with humans will rely on pragmatic communication skills—something that is not explicitly captured by the benchmarks used to evaluate the alignment of LLMs.