Proactive Conversational Agents with Inner Thoughts

Paper · arXiv 2501.00383 · Published December 31, 2024

In this paper, we demonstrate the limitations of such methods and rethink what it means for AI to be proactive in multi-party, human-AI conversations. We propose that just like humans, rather than merely reacting to turn-taking cues, a proactive AI formulates its own inner thoughts during a conversation, and seeks the right moment to contribute.

Through a formative study with 24 participants and inspiration from linguistics and cognitive psychology, we introduce the Inner Thoughts framework. Our framework equips AI with a continuous, covert train of thoughts in parallel to the overt communication process, which enables it to proactively engage by modeling its intrinsic motivation to express these thoughts. We instantiated this framework into two real-time systems: an AI playground web app and a chatbot. Through a technical evaluation and user studies with human participants, our framework significantly surpasses existing baselines on aspects like anthropomorphism, coherence, intelligence, and turn-taking appropriateness.

1 Introduction

Recent advances in Large Language Models (LLMs) have demonstrated their ability to generate high-quality text in response to human input, finding application in areas ranging from Q&A systems to writing assistants. Yet, most current LLM-based systems treat AI as passive respondents, responding only to explicit human prompts. Imagine a scenario where people are planning a trip with an AI agent: they must constantly prompt the AI, which passively waits for instructions instead of actively contributing. On the other end of the spectrum, systems like GitHub Copilot1 tend to overcompensate, offering constant suggestions that can overwhelm users.

Neither extreme — AI that is only reactive nor AI that is always responding — is ideal. In the context of conversations, a proactive AI agent should be able to autonomously participate in socially appropriate moments, providing relevant input without requiring explicit cues. This is particularly challenging in multi-party conversations. Dyadic human- AI interactions (e.g., using Siri) often predict turn-taking based on speech features such as pause or stop words, and the next turn will be automatically allocated to the other party [14, 55]. However, in multi-party settings, these cues could be ambiguous, and multiple possible speakers may take the floor. Repeatedly prompting AI during group interactions can also become cumbersome and can disrupt the natural flow of the conversation, as illustrated in the example of trip planning.

Previous systems typically first predict the next speaker (i.e., turn-taking prediction) and then generate the next response based on conversational and contextual information. For instance, some approaches rely on the last few turns of conversations to predict the subsequent speaker [15, 20, 63], while others utilize multimodal cues, such as eye gaze and non-verbal signals [7–9]. Despite these efforts, on turn-taking prediction, they still fall short and struggle to beat the simple “repeat last” baseline strategy in social conversation contexts [15, 63]. Our formative evaluation (Table 1) also shows that when it comes to predicting the next speaker, fine-tuned LLMs perform no better than random guessing unless the next speaker is allocated (e.g., “What do you think, Alice?”). In addition, after determining the next speaker, existing works tend to use predefined speaker personas [68, 71] as additional input to guide response generation, or expand persona with commonsense [32]. However, these additional inputs and profiles are fixed and static during conversations, instead of changing through time as humans did.

We suggest an alternative and reversed perspective to think about AI proactivity: Consider how humans chat about what we did over the weekend. As we listen to others speak, we process their words, reflect on our experiences, and develop an internal train of thoughts — cognitive psychologists highlight this as the distinction between covert responses (internal thoughts and feelings) and overt responses (verbal utterances or gestures) [19, 51] in the human communication process. Then, at some point, we may feel a strong urge to share our thoughts. This might happen when we seek clarification or when someone mentions an activity we also participated in, sparking our desire to contribute. With this intention in mind, we then look for a socially appropriate moment to participate.

In this paper, we propose a new approach to proactive AI in the context of multi-party, text-based conversations: rather than simply predicting conversational turns, we explore proactive AI driven by its own internal “thoughts”. We introduce the Inner Thoughts framework. Inspired by cognitive architectures and LLM prompting techniques, this framework comprises five stages: trigger, retrieval, thought formation, evaluation, and participation, which enable AI to continuously generate a train of thoughts in parallel with an ongoing conversation, utilizing both long-term and working memory. The AI participant then determines whether to engage in the conversation based on an evaluation of its intrinsic motivation to express a particular thought at that moment.

To model intrinsic motivation, we conducted a think-aloud study with 24 participants, each of whom participated in four synchronous, text-based online group chats. Using the affinity diagram approach, we organized and analyzed the interview notes, and derived 10 high-level themes on how individuals decide to engage in conversations. These heuristics were then formalized into automatic evaluation criteria (e.g., relevance, information gap, etc.) for AIs to quantitatively rate their intrinsic motivation to participate.

We implemented our framework as two systems: a multi-agent playground web app and a chatbot. Our technical evaluation shows that conversational agents driven by Inner Thoughts significantly outperformed a next-speaker prediction plus persona baseline across all seven evaluation metrics, including turn appropriateness, coherence, anthropomorphism, perceived engagement, intelligence, initiative and adaptability. Participants preferred the Inner Thoughts approach over 82% of the times, noting more natural turn-taking and contextually aware contributions, while the baseline was less preferred for its mechanical and disjointed responses.

2.1 Proactive AI and Conversational Agents

Proactive AI dates back to earlier work on mixed-initiative interaction [1, 29]. In contrast to AI that only passively responds to human queries, mixed-initiative interaction envisions agents that autonomously understand when to take what action, such as the LookOut system [29] that automatically identifies related dates and events in emails and then proactively suggests them to users as calendar events. In 1996, Rhodes et al. [48] introduced one of the pioneering systems to continuously supply relevant information through observation of human activities. Andolina et al. [3] developed SearchBot, which offers ongoing suggestions of related documents and entities unobtrusively [2] during voice interactions. While proactivity is a recurring theme in conversational AI research, most proactive conversational AIs focus on task-oriented contexts [23, 36, 37], with the aim of helping users achieve specific objectives. Social conversations, which can expand on open topics without having any goal to complete, are rarely addressed. In addition, past research tends to focus on generating proactive text responses to help lead and guide the conversation [16], for example, the ability of “learning to ask” [6, 15, 47, 61], understanding and initiating topic shifts [39, 57, 67], and planning future conversation [33, 42, 59] etc.

In this paper, we focus on investigating how to enable AI to proactively engage in multi-party conversations: how AI can determine the appropriate moments to speak and what contributions to make. We also choose to investigate social conversations where unlike task-oriented dialogue, the objectives are often ambiguous, and the actions required from the AI are not clearly defined.

2.2 Turn-takings in Multi-Party Conversations

For a conversational agent to engage proactively, it must understand and manage turn-taking, deciding who should speak at the end of each turn. Modeling turn-taking is still an area of active research. Existing approaches often employ an explicit mechanisms, such as a “send” button [65], push-to-talk [27, 60], and wake-words (e.g., “Hey Siri”) [34]. However, the use of explicit cues can be viewed as less conversational from users’ perspectives [66]. Mainstream conversational AI systems also use silence to detect the end of a user’s turn. However, studies show pauses within turns are typically longer than gaps between turns in human conversations [11, 58], making silence an unreliable cue for turn-taking. More importantly, this method does not generalize to multi-party conversations. In dyadic interaction, it is always clear who is supposed to speak next when the turn is yielded [55]. In the multi-party case, this becomes more ambiguous since there is more than one potential speaker who might take the turn.

Beyond an explicit mechanism, machine learning researchers have proposed data-driven methods to manage turn-taking in these conversations, primarily leveraging conversation history to predict the next speaker (i.e., the next-speaker prediction task) [15, 20, 63]. However, these methods have shown limited success. Notably, they have often failed to outperform the simple “repeat last” baseline strategy in social conversation contexts [15, 63]. In addition to using only textual data, research in HCI and HRI have leveraged other contextual, non-verbal information and “turn-taking cues”, for instance, eye gaze (e.g., looking at addressee) [45, 46], breathing (e.g., breathe in and out) [31, 41], prosody (e.g., rising or falling of pitch) [17, 18, 22, 40] and the status of the human user (e.g., passing by, stopping) [7–9] to decide if an AI should engage at a certain moment of the conversation or not.

Previous approaches on mediating turn-taking often relied on conversation history and contextual information, and typically treat the AI as a reactive agent. Inspired by human behavior, our Inner Thoughts framework takes a different perspective by modeling intrinsic motivation to speak.

2.3 Language, Thought, and LLM Agents

Recent advances in large language models (LLMs) have incorporated intermediate reasoning steps to enhance performance in complex tasks, such as Chain-of-Thought (CoT) prompting [64] whereby LLMs think step-by-step to effectively break down larger problems into reasoning steps, and Tree of Thoughts (ToT) [69] whereby LLMs explore multiple possibilities at each reasoning stage. In addition, self-reflection mechanisms can iteratively improve the model’s reasoning. ReAct [70], for example, synergizes reasoning with action-taking by having the model alternate between generating reasoning traces and performing task-specific actions. Reflexion [54] builds on this by equipping models with dynamic memory and self-criticism capabilities, allowing them to refine future actions based on past performance. Expanding on this, Generative Agents [44] simulate human-like behavior by combining memory, planning, and reflection. The recent OpenAI’s o1 preview [43] introduces another perspective on reasoning transparency by explicitly surfacing intermediate reasoning steps to make the AI’s decision-making process more interpretable to users.

The Inner Thoughts framework we propose diverges from these approaches by simulating an ongoing, parallel stream of internal thoughts that mirror human covert responses. Unlike methods such as CoT, ToT, or OpenAI o1 preview, which emphasize externalizing intermediate steps for reasoning tasks, Inner Thoughts explore leveraging these covert thoughts to equip AIs with the ability to self-initiate actions and engage proactively.