A recipe for annotating grounded clarifications
In order to interpret the communicative intents of an utterance, it needs to be grounded in something that is outside of language; that is, grounded in world modalities. In this paper we argue that dialogue clarification mechanisms make explicit the process of interpreting the communicative intents of the speaker’s utterances by grounding them in the various modalities in which the dialogue is situated. This paper frames dialogue clarification mechanisms as an understudied research problem and a key missing piece in the giant jigsaw puzzle of natural language understanding. We discuss both the theoretical background and practical challenges posed by this problem, and propose a recipe for obtaining grounding annotations. We conclude by highlighting ethical issues that need to be addressed in future work.
Clarifications are crucial to robust dialogues, and pragmatic factors — notably those shaped by the world modalities situating the conversation—have a key role to play. Referring expressions have in vision a modality in which to ground clarifications concerning objects in the world (de Vries et al., 2017); navigation instructions have in movement a modality in which to ground clarifications concerning collaborative wayfinding (Thomason et al., 2019). Clarifications grounded in situationally relevant modalities boost the redundancy required to learn to use language without explicit supervision, as they make explicit the process of negotiating the communicative intent. But despite its importance, work on clarification remains scattered.
Humans switch between clarifications grounded in different modalities seamlessly but (we shall argue) systematically. Our discussion is based around a general recipe for detecting grounded clarifications;
Collaborative grounding, on the other hand, deals with the dynamics of conversation (the ongoing exchange of speaker and hearer roles) and is rooted in situationally relevant aspects of socioperception. Alikhani and Stone (2020) note several basic mechanisms that contribute to collaborative grounding, including those for dealing with joint attention (Koller et al., 2012; Koleva et al., 2015; Tan et al., 2020), engagement (Bohus and Horvitz, 2014; Foster et al., 2017), turn taking and incremental interpretation (Schlangen and Skantze, 2009; Selfridge et al., 2012; DeVault and Traum, 2013; Eshghi et al., 2015) corrections and clarifications (Villalba et al., 2017; Ginzburg and Fernández, 2010) and dialogue management (DeVault and Stone, 2009; Selfridge et al., 2012). These mechanisms have been studied for different kinds of applications (Denis, 2010; Dzikovska et al., 2010, 2012).
It may seem plausible to expect that clarification requests will be realized as questions; however, corpus studies indicate that their most frequent realization is in declarative form (Jurafsky, 2004). Indeed, the form of a clarification request (Rodríguez and Schlangen, 2004) is not a reliable indicator of the function that the clarification request is playing. Neither does form unambiguously indicate whether a dialogue contribution is a CR or not. The surface form of explicit negotiations of meaning in dialogue are frequently non-sentential utterances (Fernández, 2006; Fernández et al., 2007). These include the prototypical positive and negative evidence of grounding (acknowledgements and clarification requests (Stoyanchev et al., 2013)) but also less-well-known forms such as self-corrections, rejections, and modifiers (Purver, 2004; Purver et al., 2018).
But it is easy to argue that she is performing four distinct, though co-temporal, actions — actions beginning and ending simultaneously. These actions are in a causal relation going up the ladder (from level 1 up to level 4): Anna must get Barny to attend her behavior t (level 1) in order to get him to hear the words she is presenting in her signals (level 2). Anna must succeed at that in order to get Barny to recognize what she means (level 3), and she must succeed at that in order to get Barny to uptake the project she is proposing (level 4). In short, causality (do something in order to get some result) climbs up the ladder; this property Clark calls upward causality.
The different levels are related to different human modalities. We say that level 1 is grounded into socioperception, an ability that humans developed for collaboration that is crucial for achieving joint attention (Tomasello et al., 2005). Level 2 is grounded in hearing if we use speech as our communication channel. Level 3 is grounded in vision when it involves recognizing referents in the real world. Level 4 is grounded in kinesthetic when it involves moving and acting in the real world. The classification, along with obstacles that the addressee may face in the various modalities during the interpretation of a conversational action, is shown in Table 3. In the rest of the paper we will refer to these modalities using the level number.
Humans switch between clarifications grounded in different modalities seamlessly and we have argued they do so systematically; in effect they do so by following a recipe for grounding classifications. We obtained this recipe by granting a role to both perceptual and collaborative grounding in clarification requests. This we did by examining Clark (1996)’s action ladder of communication and Ginzburg, Purver and colleagues (2012)’s classification of clarification phenomena, and combining the concept of level taken from the ladder of communication with Gabsdil (2003)’s test for clarification requests. We reframed Clark’s downward evidence and upwards completion properties for multimodal interactions.
This gave us the following: given an utterance, a subsequent turn is its clarification grounded in modality m if it cannot be preceded by positive evidence of understanding in m. This provides a unified way to frame clarification mechanisms and their interactions across modalities — something we view as useful in its own right given the scattered literature on clarification mechanisms. However we also suggested that this recipe was suitable for learning from data collected by crowdsourcing. We supported this by examining the claim that clarifications are rare in dialogue datasets (Ginzburg, 2012), and that current data-hungry algorithms cannot learn them.
Objection: I still don’t have a feel for how much we will gain from this when it comes to a practical, realistic use case; in particular, for an end-to-end system rather than an NLP pipeline.
Response: Being able to identify and annotate a turn as a clarification request can help an endto- end system learn to apply the mechanisms of collaborative grounding to subdialogs, which have rules that differ from modality to modality.