Variational Autoencoders for Collaborative Filtering
“Collaborative filtering is among the most widely applied approaches in recommender systems. Collaborative filtering predicts what items a user will prefer by discovering and exploiting the similarity patterns across users and items. Latent factor models [13, 19, 38] still largely dominate the collaborative filtering research literature due to their simplicity and effectiveness. However, these models are inherently linear, which limits their modeling capacity. Previous work [27] has demonstrated that adding carefully crafted non-linear features into the linear latent factor models can significantly boost recommendation performance. Recently, a growing body of work involves applying neural networks to the collaborative filtering setting with promising results [14, 41, 51, 54].
Here, we extend variational autoencoders (vaes) [24, 37] to collaborative filtering for implicit feedback. Vaes generalize linear latent-factor models and enable us to explore non-linear probabilistic latent-variable models, powered by neural networks, on large-scale recommendation datasets. We propose a neural generative model with multinomial conditional likelihood. Despite being widely used in language modeling and economics [5, 30], multinomial likelihoods appear less studied in the collaborative filtering literature, particularly within the context of latent-factor models. Recommender systems are often evaluated using ranking based measures, such as mean average precision and normalized discounted cumulative gain [21]. Top-N ranking loss is difficult to optimize directly and previous work on direct ranking loss minimization resorts to relaxations and approximations [49, 50]. Here, we show that the multinomial likelihoods are well-suited for modeling implicit feedback data, and are a closer proxy to the ranking loss relative to more popular likelihood functions such as Gaussian and logistic.
Though recommendation is often considered a big-data problem (due to the huge numbers of users and items typically present in a recommender system), we argue that, in contrast, it represents a uniquely challenging “small-data” problem: most users only interact with a tiny proportion of the items and our goal is to collectively make informed inference about each user’s preference. To make use of the sparse signals from users and avoid overfitting, we build a probabilistic latent-variable model that shares statistical strength among users and items. Empirically, we show that employing a principled Bayesian approach is more robust regardless of the scarcity of the data.
Although vaes have been extensively studied for image modeling and generation, there is surprisingly little work applying vaes to recommender systems. We find that two adjustments are essential to getting state-of-the-art results with vaes on this task:
• First, we use a multinomial likelihood for the data distribution. We show that this simple choice realizes models that outperform the more commonly used Gaussian and logistic likelihoods.
• Second, we reinterpret and adjust the standard vae objective, which we argue is over-regularized. We draw connections between the learning algorithm resulting from our proposed regularization and the information-bottleneck principle and maximum-entropy discrimination.
The result is a recipe that makes vaes practical solutions to this important problem. Empirically, our methods significantly outperform state-of-the-art baselines on several real-world datasets, including two recently proposed neural-network approaches.”
“The multinomial likelihood is less well studied in the context of latent-factor models such as matrix factorization and autoencoders. A notable exception is the collaborative competitive filtering (CCF) model [53] and its successors, which take advantage of more fine-grained information about what options were presented to which users. (If such information is available, it can also be incorporated into our vae-based approach.) We believe the multinomial distribution is well suited to modeling click data. The likelihood of the click matrix (Eq. 2) rewards the model for putting probability mass on the non-zero entries in xu . But the model has a limited budget of probability mass, since π(zu ) must sum to 1; the items must compete for this limited budget [53]. The model should therefore assign more probability mass to items that are more likely to be clicked. To the extent that it can, it will perform well under the top-N ranking loss that recommender systems are commonly evaluated on.”
“Neural networks for collaborative filtering. Early work on neural-network-based collaborative filtering models focus on explicit feedback data and evaluates on the task of rating predictions [11, 39, 41, 54]. The importance of implicit feedback has been gradually recognized, and consequently most recent research, such as this work, has focused on it. The two papers that are most closely related to our approaches are collaborative denoising autoencoder [51] and neural collaborative filtering [14].
Collaborative denoising autoencoder (cdae) [51] augments the standard denoising autoencoder, described in Section 2.3, by adding a per-user latent factor to the input. The number of parameters of the cdae model grows linearly with both the number of users as well as the number of items, making it more prone to overfitting. In contrast, the number of parameters in the vae grows linearly with the number of items. The cdae also requires additional optimization to obtain the latent factor for unseen users to make prediction. In the paper, the authors investigate the Gaussian and logistic likelihood loss functions — as we show, the multinomial likelihood is significantly more robust for use in recommender systems. Neural collaborative filtering (ncf) [14] explore a model with non-linear interactions between the user and item latent factors rather than the commonly used dot product. The authors demonstrate improvements of ncf over standard baselines on two small datasets. Similar to cdae, the number of parameters of ncf also grows linearly with both the number of users as well as items. We find that this becomes problematic for much larger datasets. We compare with both cdae and ncf in Section 4.”