Language Understanding and Pragmatics Conversational AI Systems

Why do clarification requests look different at each communication level?

Explores whether clarifications are unified speech acts or distinct mechanisms grounded in different modalities. Matters because dialogue systems treat clarifications uniformly, missing most of them.

Note · 2026-04-18 · sourced from Linguistics, NLP, NLU
Why do AI conversations reliably break down after multiple turns? Where exactly do language models fail at structural language tasks?

"A Recipe for Annotating Grounded Clarifications" (Benotti & Blackburn 2021, arXiv:2104.08964) maps dialogue clarification mechanisms onto Clark's (1996) action ladder of communication, revealing that clarifications are not a uniform speech act but are grounded in distinct modalities at each level:

Causality flows upward through these levels — you must achieve attention to enable signal recognition, signal recognition to enable meaning recognition, meaning to enable action uptake. A clarification at any level is triggered when positive evidence of understanding at that level is absent.

Key implications:

Humans switch between clarifications grounded in different modalities seamlessly but systematically. The most common realization of clarification requests is declarative form, not interrogative — form is unreliable as an indicator of clarification function. This means current LLM dialogue systems that detect clarification needs via question detection miss most clarifications.

The paper's formal recipe — "a subsequent turn is a clarification grounded in modality m if it cannot be preceded by positive evidence of understanding in m" — provides a testable criterion that could inform dialogue system design.

This extends Why do language models skip the calibration step? by specifying that repair itself is multimodal and hierarchical, not a single mechanism. It also connects to Do language models actually build shared understanding in conversation? — LLMs cannot ground clarifications at levels 1, 3, or 4 because they lack the relevant modalities (socioperception, vision, kinesthetics), leaving only level 2 (text-as-signal) available.

Original note title

clarification mechanisms are grounded in distinct modalities that follow Clarks action ladder — socioperception hearing vision and kinesthetics