InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
“Recommendation systems now underpin many essential components of the web ecosystem, including search result ranking, ecommerce product placement, and media suggestions in streaming services. Over the last several years, many of these services have begun to employ deep learning (DL) models in their recommendation infrastructure, to better exploit historical patterns in their data. In turn, DL-based product recommendation has quickly become one of the most commercially significant applications of DL. Companies have begun to invest heavily in DL recommendation infrastructure, often maintaining entire datacenters and super-clusters for the sole purpose of recommender model training [1]. But in many cases, these infrastructural investments have run into critical hurdles [7]. Practitioners and cluster administrators are discovering that the training optimization challenges faced with DL recommender models differ significantly from those seen in historical practice with other DL model types [51]. In particular, recent studies of industry clusters have found that the unique design of recommender model architectures has left training pipelines susceptible to inefficiencies in data ingestion [54].
Most DL architectures are dominated by high-intensity matrix operators, and standard tooling for DL training optimization has evolved to support models that fit this pattern [9, 12, 14, 19, 29, 42, 44]. In such cases, model execution usually dominates training times to such a degree that data ingestion procedures (e.g. disk loading, shuffling, etc) can be overlapped with and hidden underneath the matrix operation times. Unfortunately, however, DL-based recommender models (DLRMs) are atypical in this regard. Recommender datasets are generally composed of both sparse (categorical) and dense (continuous) features, and joining information across features requires transforming these two representations into a common format. To this end, DLRM architectures use embedding tables to transform categorical inputs into dense embedding vectors through a hash-table lookup. These can then be combined with the dense vectors and fed through some secondary DL model to produce user-item probability ratings [50]. Figure 1 illustrates a typical architecture.”