Large Language Models as Zero-Shot Conversational Recommenders

Paper · arXiv 2308.10053 · Published August 19, 2023

(1) Data: To gain insights into model behavior in “in-the-wild” conversational recommendation scenarios, we construct a new dataset of recommendation-related conversations by scraping a popular discussion website. This is the largest public real-world conversational recommendation dataset to date. (2) Evaluation: On the new dataset and two existing conversational recommendation datasets, we observe that even without fine-tuning, large language models can outperform existing fine-tuned conversational recommendation models. (3) Analysis: We propose various probing tasks to investigate the mechanisms behind the remarkable performance of large language models in conversational recommendation.

typical conversational recommender contains two components [10, 41, 64, 74]: a generator to generate natural-language responses and a recommender to rank items to meet users’ needs.

Data. We construct Reddit-Movie, a large-scale conversational recommendation dataset with over 634k naturally occurring recommendation seeking dialogs from users from Reddit2, a popular discussion forum. Different from existing crowd-sourced conversational recommendation datasets, such as ReDIAL [41] and INSPIRED [22], where workers role-play users and recommenders, the Reddit-Movie dataset offers a complementary perspective with conversations where users seek and offer item recommendation in the real world. To the best of our knowledge, this is the largest public conversational recommendation dataset, with 50 times more conversations than ReDIAL.

Existing conversational recommendation datasets are usually crowd-sourced [22, 32, 41, 75] and thus only partially capture realistic conversation dynamics. For example, a crowd worker responded with "Whatever Whatever I’m open to any suggestion." when asked about movie preferences in ReDIAL; this happens since crowd workers often do not have a particular preference at the time of completing a task. In contrast, a real user could have a very particular need,

We process all Reddit posts from 2012 Jan to 2022 Dec

We process the raw data with the pipeline of conversational recommendation identification, movie mention recognition and movie entity linking8.

Current evaluation for conversational recommendation systems does not differentiate between repeated and new items in a conversation. We observed that this evaluation scheme favors systems that optimize for mentioning repeated items. As shown in Figure 3, a trivial baseline that always copies seen items from the conversation history has better performance than most previous models under the standard evaluation scheme. This phenomenon highlights the risk of shortcut learning [18], where a decision rule performs well against certain benchmarks and evaluations but fails to capture the true intent of the system designer

In this conversation, Terminator at the 6th turn is used as the ground-truth item. The system repeated this Terminator because the system quoted this movie for a content-based discussion during the conversation rather than making recommendations. Given the nature of recommendation conversations between two users, it is more probable that items repeated during a conversation are intended for discussion rather than serving as recommendations. We argue that considering the large portion of repeated items (e.g., more than 15% ground-truth items are repeated items in INSPIRED), it is beneficial to remove repeated items and re-evaluate CRS models to better understand models’ recommendation ability.

Finding 3 - LLMs may generate out-of-dataset item titles, but few hallucinated recommendations. We note that language models trained on open-domain data naturally produce items out of the allowed item set during generation. In practice, removing these items improves the models’ recommendation performance. Large language models outperform other models (with GPT-4 being the best) consistently regardless of whether these unknown items are removed or not, as shown in Table 2. Meanwhile, Table 3 shows that around 95% generated recommendations from GPT-based models (around 81% from BAIZE and 87% from Vicuna) can be found in IMDB 11 by string matching. Those lower bounds of these matching rates indicate that there are only a few hallucinated item titles in the LLM recommendations in the movie domain

Experiment Setup.

Motivated by the probing work of [53], we posit that two types of knowledge in LLMs can be used in CRS:

• Collaborative knowledge, which requires the model to match items with similar ones, according to community interactions like “users who like A typically also like B”. In our experiments, we define the collaborative knowledge in LLMs as the ability to make accurate recommendations using item mentions in conversational contexts.

• Content/context knowledge, which requires the model to match recommended items with their content or context information. In our experiments, we define the content/context knowledge in LLMs as the ability to make accurate recommendations based on all other conversation inputs rather than item mentions, such as contextual descriptions, mentioned genres, and director names.

To understand how LLMs use these two types of knowledge, given the original conversation context 𝑆 (Example in Figure 1), we perturb 𝑆 with three different strategies as follows and subsequently re-query the LLMs. We denote the original as 𝑆0:

• S0 (Original): we use the original conversation context.

• S1 (ItemOnly): we keep mentioned items and remove all natural language descriptions in the conversation context.

• S2 (ItemRemoved): we remove mentioned items and keep other content in the conversation context.

• S3 (ItemRandom): we replace the mentioned items in the conversation context with items that are uniformly sampled from the item set I of this dataset, to eliminate the potential influence of 𝑆2 on the sentence grammar structure.

Finding 4 - LLMs mainly rely on content/context knowledge to make recommendations. Figure 5 shows a drop in performance for most models across various datasets when replacing the original conversation text Original (𝑆0) with other texts, indicating that LLMs leverage both content/context knowledge and collaborative knowledge in recommendation tasks. However, the importance of these knowledge types differs. Our analysis reveals that content/ context knowledge is the primary knowledge utilized by LLMs in CRS. When using ItemOnly (𝑆1) as a replacement for Original, there is an average performance drop of more than 60% in terms of Recall@5. On the other hand, GPT-based models experience only a minor performance drop of less than 10% on average when using ItemRemoved (𝑆2) or ItemRandom (𝑆3) instead of Original. Although the smaller-sized model Vicuna shows a higher performance drop, it is still considerably milder compared to using ItemOnly. To accurately reflect the recommendation abilities of LLMs with ItemRemoved and ItemRandom, we introduce a new post-processor

denoted as Φ2 (describe in the caption of Figure 5). By employing Φ2, the performance gaps between Original and ItemRemoved (or ItemRandom) are further reduced. Furthermore, Figure 6 demonstrates the consistent and close performance gap between Original and ItemRemoved (or ItemRandom) across different testing samples, which vary in size and the number of item mentions in Original. These results suggest that given a conversation context, LLMs primarily rely on content/context knowledge rather than collaborative knowledge to make recommendations. This behavior interestingly diverges from many traditional recommenders like collaborative filtering [23, 24, 36, 46, 55, 58] or sequential recommenders [25, 33, 59, 73], where user-interacted items are essential.

Finding 5 - GPT-based LLMs possess better content/context knowledge than existing CRS. From Table 4, we observe the superior recommendation performance of GPT-based LLMs against representative conversational recommendation or text-only models on all datasets, showing the remarkable zero-shot abilities in understanding user preference with the textual inputs and generating correct item titles. We conclude that GPT-based LLMs can provide more accurate recommendations than existing trained CRS models in an ItemRemoved (𝑆2) setting, demonstrating better content/ context knowledge.

underperform existing representative CRS or ItemCF models by 30% when using only the item-based conversation context ItemOnly (𝑆1).

Reddit dataset has the most content/ context information among the three conversational recommendation datasets. Those observations are also aligned with the results in Figure 5 and table 4, where LLMs – which possess better content/context knowledge than baselines – can achieve higher relative improvements compared to the other two datasets. Meanwhile, the content/context information in Reddit is close to question answering and conversational search, which is higher than existing conversational recommendation and chit-chat datasets.

we find that using the existing models, relying solely on collaborative information, is insufficient to provide satisfactory recommendations. We speculate that either (1) more advanced models or training methods are required to better comprehend the collaborative information in CRS datasets, or (2) the collaborative information in CRS datasets is too limited to support satisfactory recommendations.

Finding 10 - LLM recommendations suffer from popularity bias in CRS. Popularity bias refers to a phenomenon that popular items are recommended even more frequently than their popularity would warrant [8]. Figure 8 shows the popularity bias in LLM recommendations, though it may not be biased to the popular items in the target datasets. On ReDIAL, the most popular movies such as Avengers: Infinity War appear around 2% of the time over all ground-truth items; On Reddit, the most popular movies such as Everything Everywhere All at Once appears less than 0.3% of the time over ground-truth items. But for the generated recommendations from GPT-4 (other LLMs share a similar trend), the most popular items such as The Shawshank Redemption appear around 5% times on ReDIAL and around 1.5% times on Reddit. Compared to the target datasets, LLMs recommendations are more concentrated on popular items, which may cause further issues like the bias amplification loop [8]. Moreover, the recommended popular items are similar across different datasets, which may reflect the item popularity in the pre-training corpus of LLMs.

More recently, as natural language processing has advanced, the community developed "deep"CRS [10, 41, 64] that support interactions in natural language. Aside from collaborative filtering signals, prior work shows that CRS models benefit from various additional information. Examples include knowledge-enhanced models [10, 74] that make use of external knowledge bases [1, 47], review-aware models [49], and session/sequence-based models [43, 76]. Presently, UniCRS [64], a model built on DialoGPT [69] with prompt tuning [4], stands as the state-of-the-art approach on CRS datasets such as ReDIAL [41] and INSPIRED [22]. Currently, by leveraging LLMs, [16] proposes a new CRS pipeline but does not provide quantitative results, and [63] proposes better user simulators to improve evaluation strategies in LLMs. Unlike those papers, we uncover a repeated item shortcut in the previous evaluation protocol, and propose a framework where LLMs serve as zero-shot CRS with detailed analyses to support our findings from both model and data perspectives

we initially address a repetition shortcut in previous standard CRS evaluations, which can potentially lead to unreliable conclusions regarding model design. Subsequently, we demonstrate that LLMs as zero-shot CRS surpass all fine-tuned existing CRS models in our experiments. Inspired by their effectiveness, we conduct a comprehensive analysis from both the model and data perspectives to gain insights into the working mechanisms of LLMs, the characteristics of typical CRS tasks, and the limitations of using LLMs as CRS directly.