Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Paper · arXiv 2401.14717 · Published January 26, 2024

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction finetuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

As one of its most basic capabilities, the system should be able to determine when to take turns naturally and with minimal latency in a dialogue with the user, and without the need for push-to-talk or wakewords. One common solution for turn-taking is to trigger the system’s response after a period of silence based on a predefined threshold [5, 6, 7, 8]. However, this threshold-based method may result in a suboptimal user experience due to lack of naturalness [9, 10]. Another behavior that is important for managing human-human conversations that are a challenge for present-day conversational systems is backchanneling [11, 12]. Backchannels are defined as short utterances expressing acknowledgment or reactions on the part of the listener, without signaling an intent to take a turn, such as “uh-huh”, “oh no” and “right”. They typically occur during the current speaker’s turn and do not necessarily trigger turn-taking [13, 14, 11].

Going back to conversation analysis in linguistic pragmatics [15], there is a long history of descriptive and computational research trying to capture turn-taking and backchanneling cues in multiple modalities. In the acoustic domain, prosodic features such as duration, pitch, voice quality and intensity have been shown to have high correlation with turn-taking and backchannel locations [16, 11, 17, 9].