Multi-Task End-to-End Training Improves Conversational Recommendation

Paper · arXiv 2305.06218 · Published May 8, 2023

“The modern recommendation systems found in commercial applications are largely based on implicit preferences, such as a user’s history of web page clicks, item purchases, or media streams, with the record of these actions used to retrieve relevant recommendations (Rendle et al., 2012). This approach often works, but in the case where a user might not have an extensive history, or might desire a recommendation which doesn’t match their usual niche, we might want a system which can take advantage of explicit preferences. With the growing success of deep learning language models, it has become possible to design conversational recommendation models which can communicate with a user directly while retrieving custom recommendations based on the user’s explicit preferences.

Most previous work on conversational recommender systems adopts a multi-component approach (Gao et al., 2021). These models often are implemented using a recommendation component, which analyzes the mentioned entities in order to predict a related item, and a dialogue component, which analyzes the input phrases and generates a conversational response (Jannach et al., 2020). Multi-component approaches are appealing because they can be built directly from standard models in the dialogue and recommendation fields. However, the knowledge learned by each component is not immediately available to the other components (i.e., the item recommendation model does not benefit directly from conversation state, and vice versa), preventing these approaches from taking advantage of the data to its fullest extent.

This paper strives to show the feasibility of a unified model for conversational recommendations by leveraging a single large transformer model to generate both relevant recommendations and natural dialogue and evaluating the benefits of a fully unified dialogue and recommendation module.”

To bypass the need for a single large dialogue dataset, we fine tune the pretrained T5 model on a relatively small dataset of dialogues, and incorporate movie relationship, attribute, and description information from additional datasets using a multitask setup.

While these challenges have been approached using a variety of multi-component models, our model aims to demonstrate that a single-component transformer model can perform the task of conversational recommendations, and even benefit from cross-task transfer due to its unified design.

Chen et al. (2019) took this approach one step further, creating a conversational recommendation system which would use mentioned entities in the dialogue to conduct a knowledge graph search of related items and add a vocabulary bias based on the user representation back into the transformer-based dialogue generation module. Although this model demonstrates the potential for transfer between dialogue and recommendation tasks, it requires a complex structure where incomplete representations of both the dialogue and recommendation features are passed to separate components and then joined with a switching network.

The main idea of our approach is to formulate the conversational recommendation task as an instance of the text-to-text problem. We finetune a pretrained transformer model on the movie recommendation dialogues contained in the ReDial dataset, and improve the model’s ability to utilize movie attributes and descriptive details within the dialogues through the introduction of additional training tasks in a multi-task learning setting.

2.2 ReDial Dialogue Task

The ReDial (Recommendation Dialogues) dataset is an annotated set of 11248 dialogues collected through Amazon Mechanical Turk (Li et al., 2018). Each dialogue contains the movies and messages sent between two parties acting as either a "recommender" or a "recommendation seeker". Although this dataset is relatively small, and doesn’t necessarily capture as much movie relationship and attribute data as other recommendation-focused datasets, we have found that it provides enough examples for the T5 to learn the style and structure of conversational recommendations.

For each conversation in the dataset we create a training example corresponding to each response from the human recommender. The model inputs contain the conversation up to a certain recommender utterance, with the outputs containing the next utterance from the recommender party. Using this format the T5 model can learn to parse relevant movie, attribute, and dialogue details from the previous messages in the conversation and formulate an appropriate response. We use the T5’s standard vocabulary, so movie titles are processed by the word, the same as any other piece of the input. To help the model learn these titles, @ signs are used to separate movie titles from the rest of the dialogues.

As shown in Table 1, these reviews are processed into examples where the model is asked to predict the next sentence of a review given a movie title and the truncated review 1.