The Architectural Implications of Facebook’s DNN-based Personalized Recommendation
“Deep learning has become a cornerstone in many production scale data center services. As web-based applications continue to expand globally, so does the amount of compute and storage resources devoted to deep learning training and inference [18], [32], [38]. Personalized recommendation is an important class of these services. Deep learning based recommendation systems are broadly used throughout industry to predict rankings for news feed posts and entertainment content [22], [26]. For instance, in 2018, McKinsey and Tech Emergence estimated that recommendation systems were responsible for driving up to 35% of Amazon’s revenue [17], [53], [58].
Figure 1 illustrates the fraction of AI inference cycles spent across recommendation models in a production data center. DNN-based personalized recommendation models comprise up to 79% of AI inference cycles in a production-scale data center. While potentially hundreds of recommendation-models are used across the data center, we find that three recommendation model classes, RMC1, RMC2, and RMC3, consume up to 65% of AI inference cycles. These three types of recommendation models follow distinct recommendation model architectures that result in different performance characteristics and hardware resource requirements, and thus are the focus of this paper.”
“Personalized recommendation is the task of recommending new content to users based on their preferences [22], [26]. Estimates show that up to 75% of movies watched on Netflix and 60% of videos consumed on YouTube are based on suggestions from their recommendation systems [17], [53], [58].
Central to these services is the ability to accurately, and efficiently rank content based on users’ preferences and previous interactions (e.g., clicks on social media posts, ratings, purchases). Building highly accurate personalized recommendation systems poses unique challenges as user preferences and past interactions are represented as both dense and sparse features [25], [45].
For instance, in the case of ranking videos (e.g., Netflix, YouTube), there may be tens of thousands of videos that have been seen by millions of viewers. However, individual users interact with only a handful of videos. This means interactions between users and videos are sparse. Sparse features typically represent categorical inputs. For example, for ranking videos, a categorical input may represent the type of device or users’ preferences for a genre of content [16]. Dense features (e.g., user age) represent continuous inputs. Categorical inputs are encoded as multi-hot vectors where a 1 represents positive interactions. Given the potentially large domain (millions) and small number of interactions (sparse), multi-hot vectors must first be transformed into real-valued dense vectors using embedding table operations. Sparse features not only make training more challenging but also require intrinsically different operations (e.g., embedding tables) which impose unique compute, storage capacity, and memory access pattern challenges.”