Backtracing: Retrieving the Cause of the Query
While information retrieval (IR) systems may provide answers for such user queries, they do not directly assist content creators—such as lecturers who want to improve their content—identify segments that caused a user to ask those questions. We introduce the task of backtracing, in which systems retrieve the text segment that most likely caused a user query. We formalize three realworld domains for which backtracing is important in improving content delivery and communication: understanding the cause of (a) student confusion in the LECTURE domain, (b) reader curiosity in the NEWS ARTICLE domain, and (c) user emotion in the CONVERSATION domain. We evaluate the zero-shot performance of popular information retrieval methods and language modeling methods, including bi-encoder, re-ranking and likelihoodbased methods and ChatGPT. While traditional IR systems retrieve semantically relevant information (e.g., details on “projection matrices” for a query “does projecting multiple times still lead to the same point?”), they often miss the causally relevant context (e.g., the lecturer states “projecting twice gets me the same answer as one projection”). Our results show that there is room for improvement on backtracing and it requires new retrieval approaches.
To formalize this task, we introduce a novel retrieval task called backtracing. Given a query (e.g., a student question) and a corpus (e.g., a lecture transcript), the system must identify the sentence that most likely provoked the query. We formalize three real-world domains for which backtracing is important for improving content delivery and communication. First is the LECTURE domain where the goal is to retrieve the cause of student confusion; the query is a student’s question and the corpus is the lecturer’s transcript. Second is the NEWS ARTICLE domain where the goal is to retrieve the cause of a user’s curiosity in the news article domain; the query is a user’s question and the corpus is the news article. Third is the CONVERSATION domain where the goal is to retrieve the cause of a user’s emotion (e.g., anger); the query is the user’s conversation turn expressing that emotion and the corpus is the complete conversation. Figure 2 illustrates an example for each of these domains. These diverse domains showcase the applicability and common challenges of backtracing for improving content generation, similar to heterogeneous IR datasets like BEIR (Thakur et al., 2021
• We propose a new task called backtracing where the goal is to retrieve the cause of the query from a corpus. This task targets the information need of content creators who wish to improve their content in light of questions from information seekers.
• We formalize a benchmark consisting of three domains for which backtracing plays an important role in identifying the context triggering a user’s query: retrieving the cause of student confusion in the LECTURE setting, reader curiosity in the NEWS ARTICLE setting, and user emotion in the CONVERSATION setting.
CONVERSATION We use two-person conversations which have been annotated with emotions, such as anger and fear, and cause of emotion on the level of conversation turns. Conversations are natural settings for human interaction where a speaker may accidentally say something that evokes strong emotions like anger. These emotions may arise from cumulative or non-adjacent interactions, such as the example in Figure 2. While identifying content that evokes the emotion expressed via a query differs from content that causes confusion, the ability to handle both is key to general and effective backtracing systems that retrieve information based on causal relevance. Identifying utterances that elicit certain emotions can pave the way for better emotional intelligence in systems and refined conflict resolution tools. We adapt the conversation dataset from Poria et al. (2021) which contain turnlevel annotations for the emotion and its cause, and is designed for recognizing the cause of emotions. The query is one of the speaker’s conversation turn annotated with an emotion and the corpus is all of the conversation turns. To ensure there are enough distractor sentences, we use conversations with at least 5 sentences and use the last annotated utterance in the conversation. The final dataset contains 671 examples.
How similar is the query to the cause? To answer this question, we plot the semantic similarity of the query to the ground-truth cause sentence (GT) in Figure 4. We additionally plot the maximal similarity of the query to any corpus sentence (Max) and the difference between the groundtruth and maximal similarity (Diff).
Additionally, indicated by the green bars, CONVERSATION and LECTURE have the largest differences between the ground-truth and maximal similarity sentences, whereas NEWS ARTICLE has the smallest. This suggests that there may be multiple passages in a given document that share a surface-level resemblance with the query, but a majority do not cause the query in the CONVERSATION and LECTURE domains. In the NEWS ARTICLE domain, the query and cause sentence exhibit higher semantic similarity because the queries are typically short and mention the event or noun of interest. Altogether, this analysis brings forth a key insight: Semantic relevance doesn’t always equate causal relevance.
For the NEWS ARTICLE domain, there is a noticeable peak at the beginning of the documents which suggests little context is needed to identify the cause. This aligns with the typical structure of news articles where crucial information is introduced early to capture the reader’s interest. As a result, readers may have immediate questions from the onset. Conversely, in the CONVERSATION domain, the distribution peaks at the end, suggesting that more context from the conversation is needed to identify the cause. Finally, in the LECTURE domain, the distribution is relatively uniform which suggests a broader contextual dependence.