Leveraging Large Language Models in Conversational Recommender Systems
effectively leveraging LLMs within a CRS introduces new technical challenges, including properly understanding and controlling a complex conversation and retrieving from external sources of information. These issues are exacerbated by a large, evolving item corpus and a lack of conversational data for training. In this paper, we provide a roadmap for building an end-to-end large-scale CRS using LLMs. In particular, we propose new implementations for user preference understanding, flexible dialogue management and explainable recommendations as part of an integrated architecture powered by LLMs. For improved personalization, we describe how an LLM can consume interpretable natural language user profiles and use them to modulate session-level context. To overcome conversational data limitations in the absence of an existing production CRS, we propose techniques for building a controllable LLM-based user simulator to generate synthetic conversations. As a proof of concept we introduce RecLLM, a large-scale CRS for YouTube videos built on LaMDA, and demonstrate its fluency and diverse functionality through some illustrative example conversations.
one of the appeals of LLMs is their sense of naturalness and unpredictability, but when operating in a task-oriented setting this means that controlling an LLM can be more difficult than with a template based system. Particularly challenging in the recommendation setting is how to interface between the LLM and the underlying recommendation engine. One approach is to have the LLM be the recommendation engine in addition to its role as a dialogue agent (see e.g. [42]). However, for large-scale recommender applications the item corpus can contain millions or billions of always-changing items, making it challenging for an LLM to memorize the corpus within its parameters. Alternatively the LLM must somehow connect to an external recommendation engine or database, passing on relevant preference information.
• A dialogue management module that reframes natural language generation, preference understanding, context tracking, and calls to a recommendation engine as a unified language modeling task performed by a single LLM.
• A general conceptual framework for performing retrieval with an LLM over a huge corpus of items. Various solutions are presented depending on efficiency requirements and what data and external APIs are available.
• A joint ranking / explanation module that uses an LLM to extract user preferences from an ongoing conversation and match them to textual artifacts synthesized from item metadata. As a byproduct of intermediate chain-of-thought reasoning [95], the LLM generates natural language justifications for each item shown to the user, increasing the transparency of the system.
• Incorporation of persistent, interpretable natural language user profiles as additional input to system LLMs, which supplements session-level context and improves the personalized experience.
• Techniques for building controllable LLM-based user simulators that can be used to generate synthetic conversations for tuning system modules.
3.2.1 Retrieval. The purpose of the retrieval phase is to take the full corpus, which for some domains such as videos or urls may contain hundreds of millions of items, and based on the context select a small number of candidate items (e.g. 100) that will be fed to a downstream ranker.
…
Generalized Dual Encoder Model. A popular solution to retrieval in traditional deep learning based recommenders is to use a dual encoder model consisting of two neural net towers, one to encode the context and one to encode the items (see e.g [102] and Figure 10a). Item embeddings can be generated offline using the item tower and stored in an efficient data structure. An approximate nearest neighbor lookup can then use the generated context embedding to perform a sub-linear time retrieval of item embeddings at inference time [98].
…
One downside to this approach of pulling embeddings from the internals of an LLM is that it severely hampers our ability to learn a retrieval model in a sample efficient way. Dual encoder models trained from scratch require large amounts of training data to constrain the context tower embeddings to occupy the same subspace as the item tower embeddings.
…
Direct LLM Search. In this method the LLM directly outputs ids or titles of items to recommend as text. The tractable search algorithm is an exact or fuzzy match against items in the corpus and the recommendation engine plays no role beyond this simple matching. The LLM must learn to output these ids/titles through some combination of its pretraining
…
Concept Based Search. In this method the LLM outputs a list of concepts, which are then embedded and aggregated by the recommendation engine into a single context embedding. This is used to lookup items through approximate k-nearest neighbor search
extracting relevant concepts from a conversation is a natural task that can be taught to an LLM through in-context learning or tuning
Search API Lookup. In this method, the LLM directly outputs a search query, which gets fed into a black-box search API to retrieve items.
Ranking / Explanations. After candidate items have been retrieved, a ranker decides which of them will be included in the recommendation slate and in what order.
One of the key advantages to a CRS is the ability of the user to articulate their preferences over the course of a session, so that the system can assist them without necessarily needing any prior background information. Despite this, the personalized experience can be improved if the system has built up a profile of the user beforehand so that there is a mutual starting base to build the conversation on top of.
In traditional deep learning based recommender systems, nonverbal interaction signals such as clicks or ratings are often used to train embedding representations of a user that can be fed into a neural net. In RecLLM we instead represent users with natural language profiles (see e.g. [70]), which can be consumed by an LLM. These are more transparent compared to embeddings and specific pieces of information can usually be attributed to an original source, which aids in explainability.
The ideal property we would like our user simulator to have when synthetically generating data for evaluation or training is realism: Conversations between the user simulator and CRS should be nearly indistinguishable from conversations between a representative group of real users and the CRS. Let R be a set of sessions generated by having real users interact with a particular CRS, and Q be a set of simulated sessions sampled from the CRS and a user simulator 𝑓 according to the procedure outlined above. We offer three possible ways to measure the realism of 𝑓 :
• Have crowdsource workers attempt to distinguish between simulated sessions coming from Q and real sessions coming from R.
• Train a discriminator model [28] on the same differentiation task.
• Let 𝑔(𝑆) → [1, 𝑘] be a function that classifies a session into 𝑘 categories and let𝐺 = {𝑔𝑖 } be an ensemble of such classifiers. One way to define such an ensemble is by adapting dialogue state tracking artifacts used within the dialogue management module of a CRS (see Section 3.1). For instance, we can have a classifier that labels the user intent at a specific turn, or the topics that are covered within a session, or the primary sentiment of a session. Once defined, we can measure how close the distributions Q and R are by matching statistics according to the classifier ensemble 𝐺.
A necessary condition of realism is diversity: Simulated sessions from Q should have sufficient variation to invoke all the different functionality of a CRS users will encounter in practice when using the system. It may be that in certain situations measuring realism directly is difficult, for instance if collecting a representative set of real user sessions is infeasible.
Controlled Simulation. Our starting point for building a user simulator is the observation that an unconstrained LLM built for dialogue such as LaMDA [86] can interact with a CRS in a similar way to real users. The LLM takes as input the full history of the ongoing conversation and outputs the next user utterance, analogous to how a CRS dialogue manager can use an LLM to generate system utterances. However, we would like to exhibit greater control over the simulator to increase its realism. In controlled simulation, we condition the user simulator on additional latent (to the CRS) variables that allow us to guide its behavior in a certain direction. We explore with two different variations:
• Session-level control: A single variable 𝑣 is defined at the beginning of the session and is used to condition the user simulator throughout the session. For instance, we could define 𝑣 as a user profile such as the ones discussed in Section 3.3.
• Turn-level control: A distinct variable 𝑣𝑖 is defined at each turn of the session and is used to condition the simulator for that turn. For instance, we could define each 𝑣𝑖 to be a user intent for the simulator to adopt at that turn.
translate the variable into text that can be included as part of the simulator’s input along with the rest of the conversation. For instance, for the user profile example we could append the statement "I am a twelve year old boy who enjoys painting and video games" to the beginning of the conversation to induce the LLM to imitate this personality.
Generating Synthetic Training Data. To use a user simulator to generate data for supervised training of one of the CRS system modules an additional property is needed: ground truth labels that the system can learn from.
a ground truth label for the primary user sentiment within 𝑆𝑖 coming from a set of possible labels 𝐿, e.g {angry, satisfied, confused, ...}. We can use controlled user simulation to solve this problem, by defining a session level variable 𝑣 over this set of labels 𝐿. First we sample a variable 𝑣 from 𝐿 (e.g. "angry") and then condition the simulator based on this label, for instance in a priming implementation by appending the message "You are an angry user" to the beginning of the input of the simulator. If we are able to solve this LLM control problem effectively then we can attach a label 𝑙𝑖 ="angry" to the session 𝑆𝑖 and trust that with high probability it will be accurate.