Insert-expansions For Tool-enabled Conversational Agents

Paper · arXiv 2307.01644 · Published July 4, 2023
Conversation Architecture StructureNovel ArchitecturesTool Computer Use

“This paper delves into an advanced implementation of Chain-of-Thought-Prompting in Large Language Models, focusing on the use of tools (or "plug-ins") within the explicit reasoning paths generated by this prompting method. We find that tool-enabled conversational agents often become sidetracked, as additional context from tools like search engines or calculators diverts from original user intents. To address this, we explore a concept wherein the user becomes the tool, providing necessary details and refining their requests. Through Conversation Analysis, we characterize this interaction as insert-expansion—an intermediary conversation designed to facilitate the preferred response. We explore possibilities arising from this ’user-as-a-tool’ approach in two empirical studies using direct comparison, and find benefits in the recommendation domain.”

By chaining simple tasks together, i.e., sequentially inputting outputs from previous steps to models, language models can solve more complex problems more reliably [12]. In this way, it is possible to call a language model multiple times until a final answer is returned1. Humans can think using writing [13], and language models imitate reasoning by generating written text, step-by-step. Since this paradigm allows intermediate steps, the idea has arisen to insert calls to tools, such as search engines, calculators, or python functions [14]. Thereby, the language model is simulating the behavior of a computer user, and can therefore incorporate information accessible via these tools into its final answer. In this way, tool-enabled language models depart from simple function approximators, and become so-called augmented language models [15], with the core human capabilities of reasoning and tool-use [16] as the exemplar to be imitated.

These developments have been kick-started by prompting models to think through their assigned tasks step-by-step [10]. Reducing complex into simpler tasks has been advocated at least since Cartesius [11]. By chaining simple tasks together, i.e., sequentially inputting outputs from previous steps to models, language models can solve more complex problems more reliably [12]. In this way, it is possible to call a language model multiple times until a final answer is returned1. Humans can think using writing [13], and language models imitate reasoning by generating written text, step-by-step. Since this paradigm allows intermediate steps, the idea has arisen to insert calls to tools, such as search engines, calculators, or python functions [14]. Thereby, the language model is simulating the behavior of a computer user, and can therefore incorporate information accessible via these tools into its final answer. In this way, tool-enabled language models depart from simple function approximators, and become so-called augmented language models [15], with the core human capabilities of reasoning and tool-use [16] as the exemplar to be imitated.

With social exemplars as a main source of inspiration, chat models have been trained to mimic human speech patterns [8]. Now that thought is imitated, less effort needs to be allocated to approximate these surface patterns, since many of them are a result of human reasoning and planning during dialog [17]. This means that natural speech patterns may result as a side-effect of more closely imitated reasoning paths.

The augmented language models that emulate reasoning are still meant to provide answers, and even impressive reasoning paths will not lead to user satisfaction if they remain hidden, unresponsive, and long-winded. If an answer cannot be given after one or few tool uses, augmented language models will produce intermediate observations that are ever-more divergent from the initial query, since the next step is always prompted by, i.e., conditioned on, the output of the previous steps. In this way, they tend to become side-tracked, and the primary aim, providing a satisfactory answer to the user, may be lost.

However, humans sometimes think about what they say. They raise and fulfil or deny each others’ expectations; and instead of silently thinking about a final answer, humans will oftentimes probe the interlocutor, by, e.g., checking their understanding, scoping the final answer, or enhancing its appeal. This interactive nature has been extensively studied and formally described using conversation analysis [17]. If an appropriate response cannot be given immediately, human speakers tend to insert a new pair of utterances into the conversation, which is supposed to bridge the remaining gap. For example, if someone wants to sell you a souvenir, you will insert a question ascertaining its price before deciding on your final answer. This pattern often relies on explicit reasoning carried out in-between dialog utterances. Augmented language models already talk to themselves and to tools. There have also been recent developments which insert intermediate steps to directly ask users to provide context or check formatting for tool inputs2. This is a potentially powerful mechanism if used in regular chatbot interaction, since for one, it may help to avoid side-tracking in tool-enabled conversational agents because using dialog, common ground can be more easily established between interlocutors, even if one of them is a chatbot [18]. Furthermore, it replicates exactly the discussed feature of human talk-in-interaction, namely that of probing interlocutors to support fulfilling or reshaping the expectations they raised in their initial main utterance.

In this paper, we will therefore discuss how insert expansions may be used in tool-enabled conversational agents, and how their impact may be studied. For this, we present a paradigm of direct comparison, as well as data from one pilot and two empirical studies based on it.

Insert-expansions are one of several distinct building blocks of natural dialog or talk-in-interaction that can be grouped not based on topicality but on what is being done with the utterances belonging to the different blocks [17]. Talk-in-interaction, such as with a conversational agent, is successful if to every raised utterance, one of four responses is given: The nominally preferred response, such as agreement to an invitation; the dispreferred response, often a rejection; a temporizing response like "I may be there"; or a blocking response, such as "I have plans already" or a counter in the sense of "What about you?". This base pair of utterances is generally known as "adjacency pair". In natural dialog, such adjacency pairs are often extended by other pairs that are inserted before, between, or after the base pair, which determines the main action of a sequence. Pre-extensions either generically draw attention or gauge interest in the action performed with the specific intended base pair. If drawing attention fails or interest is low, the intended second base pair part may never be uttered. Multiple exchanges can occur before the first part of a base pair. A special kind of pre-extension is the pre-pre-extension, e.g. "Can I ask you something?" - "Yes?", which always precedes other pre-extensions, such as "You now know something about dialogues, don’t you?" - "I guess I do". Normally, while pre-extensions are used by the initiator, insert-expansions are used by the receiver, except in multi-turn inserts. They are meant to get the conversation from the raised expectation to its fulfillment. So-called post-first insert expansions serve to recover from misunderstandings and involve acknowledgement of the repair or a restatement of the first part of the base pair. So here, we can check understanding, clarify intent, or gather information. Pre-second inserts, on the other hand, ask for information required to choose between the four options for a second base pair part, for example the time and day for an invitation. This may include scoping the response but also enhancing the appeal (or at least managing expectations). After the second pair part has been uttered, a subdialog may continue with a follow-up. Very often, this happens minimally with so-called sequence closing thirds, such as "Great". Others are more wordy, for example if receivers tack on qualifications like "Just as friends, right?". …..

This description of sequence organization based on Schegloff’s work on Conversation Analysis [17] has empirical support and seems to be language-universal [19].

Properties of sequence organization largely generalize to text-based chats as well [20], even though conversations may follow slightly different patterns, which may be investigated using digital conversation analysis [21]. However, humanizing chatbots to allow for more natural interaction is a popular way of striving for less friction in human-chatbot interaction, and bears major benefits [22]. One way of achieving this is to explicitly anthropomorphize features of the interaction [23]. Aligning digital with natural dialog patterns fits nicely in this tradition.

Since in most applications, the users steer conversation, adjacency pair expansions of the second speaker are of primary interest for augmented language models, especially if used as conversational agents. Insert-expansions are such expansions, and potentially reduce friction in interaction as well as divergence during reasoning. They either support the first base pair part of an adjacency pair, or aim to bring about the second. Instances relevant to text-based conversations include clarifying intent, scoping responses, and enhancing appeal.

Unaugmented Large Language Models display so-called formal linguistic competence, i.e., they can handle language in itself. Where they are still lacking is in functional linguistic competence, which means that they cannot do everything humans do with language. This includes formal reasoning like logic or math, using world knowledge, situation modeling in long narratives or discourses, and being able to use communicative intent as in pragmatics or establishing common ground [24]. Popular early tools address these issues, and therefore include information retrieval from documents, search engines, and code interpreters including calculators [15].

More long-lasting runs may be enabled by extending regular chain-of-thought prompting by plan-and-solve prompting, which can generate more structured reasoning paths that otherwise would have needed to be hard-coded [33]. For these longer-running calls, decoupling observations from reasoning may serve to lessen the impact of divergence from the user intent due to misfitting observations [34]. However, if supervision by humans is feasible, insert expansion may add additional benefits even here.