Grounding Gaps in Language Model Generations
However, it is unclear whether large language models (LLMs) generate text that reflects human grounding. To this end, we curate a set of grounding acts and propose corresponding metrics that quantify attempted grounding. We study whether LLM generations contain grounding acts, simulating turn-taking from several dialogue datasets and comparing results to humans. We find that—compared to humans—LLMs generate language with less conversational grounding, instead generating text that appears to simply presume common ground. To understand the roots of the identified grounding gap, we examine the role of instruction tuning and preference optimization, finding that training on contemporary preference data leads to a reduction in generated grounding acts. Altogether, we highlight the need for more research investigating conversational grounding in human-AI interaction.
Despite the grounding gap, LLMs interact regularly with humans across various applications. For a subset of interactions, however, LLMs should generate grounding language before completing a user’s task, instead of executing literal instructions or disregarding a user’s underlying goals. This is particularly crucial in LLM-powered training systems, where LLMs simulate practice scenarios and allow individuals to rehearse and refine domain-specific skills (Shaikh et al., 2023a). LLMbased training already facilitates interaction in domains like education (Kasneci et al., 2023; Demszky et al., 2021; Wang and Demszky, 2023), conflict resolution (Shaikh et al., 2023a; Argyle et al., 2023), and emotional support (Carlbring et al., 2023; Hsu et al., 2023). In these settings, effective dialog agents must coordinate to build common ground when interacting with people.
Given the importance of generating language for conversational grounding, we ask: Do current LLMs generate dialogue acts that reflect grounding patterns between humans? If not, what aspect of LLM training exacerbates the grounding gap? We address these questions by measuring LLM generations through linguistically validated grounding acts. For example, acts that clarify or acknowledge a prior utterance offer a strong signal for measuring shared understanding (Clark and Schaefer, 1989). Building on prior work in dialogue and conversational analysis, we curate a collection of dialogue acts used to construct common ground (§2). Then, we select datasets & domains to study human-LM grounding. We focus on settings where human-human grounding is critical, and where LLMs have been applied: namely, emotional support, persuasion, and teaching (§3). After curating a set of grounding acts, we build prompted few-shot classifiers to detect them (§4). We then use LLMs to simulate turn-taking in our human-human dialogue datasets and compare agreement between human and GPT-generated grounding strategies (§5).
Because we use the exact same conversational context as in human conversations, we can quantify the grounding gap: off-the-shelf LLM generations are, on average, 77.5% less likely to contain grounding acts than humans (§6). Even in situations where LLM generations do contain a grounding act, they differ from human generations—we observe poor human-LM agreement across a range of models.
To isolate potential causes of the grounding gap, we explore a range of possible interventions, from ablating training iterations on instruction following data (SFT and PO) to designing a simple prompting mitigation (§7). We find that SFT does not improve conversational grounding, and PO erodes it. Across our experiments, we generally observe significant disagreement between grounding acts in human utterances and LLM generations.