Toward Conversational Agents with Context and Time Sensitive Long-term Memory

Paper · arXiv 2406.00057 · Published May 29, 2024

There has recently been growing interest in conversational agents with long-term memory which has led to the rapid development of language models that use retrieval-augmented generation (RAG). Until recently, most work on RAG has focused on information retrieval from large databases of texts, like Wikipedia, rather than information from long-form conversations. In this paper, we argue that effective retrieval from long-form conversational data faces two unique problems compared to static database retrieval: 1) time/event-based queries, which requires the model to retrieve information about previous conversations based on time or the order of a conversational event (e.g., the third conversation on Tuesday), and 2) ambiguous queries that require surrounding conversational context to understand. To better develop RAG-based agents that can deal with these challenges, we generate a new dataset of ambiguous and time-based questions that build upon a recent dataset of long-form, simulated conversations, and demonstrate that standard RAG based approaches handle such questions poorly. We then develop a novel retrieval model which combines chained-of-table search methods, standard vector-database retrieval, and a prompting method to disambiguate queries, and demonstrate that this approach substantially improves over current methods at solving these tasks.

1. Conversational Meta-Data Based Queries. In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data (such as time) associated with a conversational event. Such questions point to a class of common questions an conversational agent may face, which cannot be answered without some ability to retrieve information about previous conversations based on conversational meta-data, rather than semantic retrieval alone.

2. Ambiguous Questions. In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives (’this’, ’that’, etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is trivial for generation by LLMs, such statements will fool naive RAG systems as we discuss below.

There has been a growing interest in testing information retrieval (IR) in conversational agents (e.g., [11, 19, 32, 14, 17, 29]). Unlike benchmarks that test IR on static datasets, such as Wikipedia (e.g., [13, 30]), these benchmarks seek to test a conversational agent’s ability to recall and report information that may have been communicated by the user far in the past of the conversation history (i.e. out of immediate context of the model). Until very recently, however most datasets of conversation logs were relatively short, involving only a few sessions/conversations between an agent and a single user (e.g., [11, 32, 14, 17, 29]). More recently, there have been efforts to create benchmarks that involve much longer dialogues to better test long-term memory abilities vs within context retrieval.

Other tests of temporal reasoning in LLMs, such as the benchmark by [27], focus on LLMs ability to reason about the frequency, ordering, duration, causality, temporal differences, and other related temporal properties, e.g., multiple choice questions like "Form the following events in chronological order." or "Which of the following events is the longest?". These questions do not specifically require retrieval items from a memory bank based on their temporal properties (e.g., when they occurred) since the questions are usually one-off questions that provide the relevant events names/descriptions within the question.

Another minor change to make the conversations more realistic compared with previous years is that the turns introduce simple forms of user revealment, reformulation, and explicit feedback if the previous canonical response is not relevant. This makes the task a bit more realistic by having varying types of user interactions.

Temporal Reasoning in LLMs. Although, the LoCoMo dataset [19] tests temporal reasoning, this reasoning only requires the retrieval of items using content-based retrieval, e.g., the question ’how long did it take for Greg to write his novel?’, only requires a semantic retrieval system to retrieve statements about novel writing along with their associated timestamps. Other tests of temporal reasoning in LLMs, such as the benchmark by [27], focus on LLMs ability to reason about the frequency, ordering, duration, causality, temporal differences, and other related temporal properties, e.g., multiple choice questions like "Form the following events in chronological order." or "Which of the following events is the longest?". These questions do not specifically require retrieval items from a memory bank based on their temporal properties (e.g., when they occurred) since the questions are usually one-off questions that provide the relevant events names/descriptions within the question.