Embarrassingly Shallow Autoencoders for Sparse Data*

Paper · arXiv 1905.03375 · Published May 8, 2019

“Many recent improvements in collaborative filtering can be a attributed to deep learning approaches, e.g, [5, 7–9, 13, 21, 25, 26]. Unlike in areas like computer vision, however, it was found that a small number of hidden layers achieved the best recommendation accuracy. In this paper, we take this to the extreme, and define a linear model without a hidden layer (see Figure 1). The (binary) input vector indicates which items a user has interacted with, and the model’s objective (in its output layer) is to predict the best items to recommend to the user. This is done by reproducing the input as its output, as is typical for autoencoders.1 We hence named it Embarrassingly Shallow AutoEncoder (in Reverse order: easer).”

“A positive value (typically 1) in X indicates that the user interacted with an item, while a value of 0 indicates that no interaction has been observed. The parameters of the easer model are given by the item-item weight-matrix B 2 R__jIj jIj . Note that this is also similar to neighborhood- based approaches, see Section 4.3. In this weight matrix, self-similarity of an item in the input-layer with itself in the output layer is forbidden as to force the model to generalize when reproducing the input as its output (see Figure 1): hence the diagonal of the weight-matrix is constrained to zero, diag¹Bº = 0. This constraint is crucial, and is discussed in detail in the remainder of this paper. This constraint was first introduced in the Slim model [16].”

“This shows that the precision matrix is the conceptually correct similarity matrix to be used, rather than the covariance matrix, which (or some rescaled variant thereof) is typically used in state-of-the-art neighborhood-based approaches (see Section 4.3).”

“This paper is organized as follows: we define easer in the next section, using simple elements from the literature. In Section 3.1, we derive the closed-form solution of its convex training objective. This has several implications: (1) it reveals that the neighborhood-based approaches used in collaborative filtering are based on conceptually incorrect item-item similarity-matrices, while easer may be considered a principled neighborhood model, see Sections 3.2 and 4.3; (2) the code for training easer is comprised of only a few lines, see Section 3.3 and Algorithm 1; (3) if the model fits into memory, the wall-clock time for training easer can be several orders of magnitude less than for training a Slim [16] model (see Section 3.4), which is the most similar model to easer. Apart from that, we surprisingly found that easer achieved competitive ranking accuracy, and even outperformed various deep, non-linear, or probabilistic models as well as neighborhood-based approaches on most of the publicly available data-sets used in our experiments (see Section 5).”

“While the area of collaborative filtering has long been dominated by matrix factorization approaches, recent years have witnessed a surge in deep learning approaches [5, 7–9, 13, 21, 25, 26], spurred by their great successes in other fields. Autoencoders provide the model architecture that fits exactly the (plain-vanilla) collaborative filtering problem. While various network architectures have been explored, it was found that deep models with a large number of hidden layers typically do not obtain a notable improvement in ranking accuracy in collaborative filtering, compared to ’deep’ models with only one, two or three hidden layers, e.g., [7, 13, 21, 26], which is in stark contrast to other areas, like computer vision. A combination of deep and shallow elements in a single model was proposed in [5].

In contrast, easer has no hidden layer. Instead, the self-similarity of each item in the input and output layer is constrained to zero, see also Figure 1. As a consequence, the model is forced to learn the similarity of an item in the output layer in terms of the other items in the input layer. The surprisingly good empirical results of easer in Section 5 suggest that this constraint might be more effective than using hidden layers with limited capacity as to force the model to generalize well to unseen data”

“Despite the simplicity of easer, we observed that easer obtained considerably better ranking accuracy than any of the competing models on most of the data sets. This remarkable empirical result is discussed in detail in the following.

Comparison to Slim: Table 1 shows that easer achieved notably increased accuracy compared to Slim on all the data sets. This suggests that dropping the L1-norm regularization as well as the non-negativity constraint on the learned weights is beneficial. Our analysis indicates that the latter is especially important: as illustrated in Figure 2 on the Netflix data (the histograms for ML- 20M and MSD data look almost identical up to re-scaling, and are omitted), the learned weights in easer are distributed around 0. Interestingly, it turns out that about 60% of the learned weights are negative on all the data sets in our experiments (regarding both papers [13, 24]). This indicates that it is crucial to learn also the dissimilarity (negative weights) between items besides their similarity (positive weights). Moreover, when we simply set the negative weights to zero (see easer ≥ 0 in Table 1), which obviously is not the optimal non-negative solution, the resulting accuracy drops and is very close to the one of Slim. Apart from that, note that easer ≥ 0 is still quite dense (40% positive weights) compared to Slim, which indirectly indicates that the sparsity of Slim (due to L1-norm regularization) did not noticeably improve the ranking accuracy of Slim in our experiments.”