HyperBandit: Contextual Bandit with Hypernetwork for Time-Varying User Preferences in Streaming Recommendation

Paper · arXiv 2308.08497 · Published August 14, 2023
Recommenders Architectures

“While the demand for personalized recommendations has increased due to the growth of online platforms and user-generated content, it is crucial to emphasize that the recommendation models need to be updated frequently and integrated with online recommender systems to ensure optimal performance in real-time. This makes streaming recommendation a highly active area of research aimed at continuously updating the model based on users’ latest interactions with the platform and delivering relevant and timely suggestions to users [6, 7, 19, 34, 35, 40].

Nonetheless, streaming recommendation confronts a significant challenge in the form of the phenomenon of time-varying user preferences [9]. Users’ preferences change dynamically over time due to several factors such as seasonality, holidays, or circadian rhythm. As illustrated in Fig. 1, users tend to check in at places such as “Office” and “Coffee Shop” on weekday mornings, while at places like “Gym / Fitness Center” and “Church” on weekend mornings, demonstrating a weekly periodicity. In contrast to morning preferences, users tend to visit bars and spend time at home during evening hours regardless of whether it is a weekday or weekend, indicating a daily periodicity. Another interesting example of short video recommendation is that users exhibit a tendency to watch cartoons specifically on weekends, while preferring other types of content on weekdays. These recurring patterns highlight the importance of considering time-varying user preferences to avoid sub-optimal recommendations. Consequently, devising effective and efficient approaches to address the issue of users’ periodic timevarying preference is critical for achieving high-quality streaming recommendation.

As a classic framework for online learning, multi-armed bandit (MAB) algorithms have gained significant attention in recent years. A variation of MAB, known as contextual bandits [21, 22, 36], has achieved considerable success in various online services by utilizing both user feedback and contextual information related to users and items, which make it particularly advantageous in streaming recommendation scenarios. Most existing contextual bandit algorithms are constructed under stationary environment, i.e. users’ preferences remain static over time [1, 15, 22]. However, the environment is always non-stationary in reality indicating time-varying user preferences. Some studies have noticed this problem and relaxed the assumption to the piecewise stationary environment [36, 37], which enables algorithms to adaptively detect user preferences change points and discard learned model parameters for relearning. These approaches may result in performance fluctuations when handling periodic changes in user preferences. The primary reason is that these algorithms fail to recognize the periodic information of user preferences in an online manner and retrain the model even if the current period has occurred in the past. Currently, there is a notable research gap in the domain of streaming recommendations in periodic environments.

In this paper, we focus on a realistic environment setting where the reward function (i.e., the generation mechanism of user feedback) exhibits periodicity over time. Specifically, a large time period can be divided into multiple smaller periods in a periodic manner (e.g., based on the specific day of the week and different time slots within a day), and the reward function demonstrates a similar distribution whenever the same time period is encountered. Moreover, these time periods can be observed by the model and utilized for periodicity modeling and online adjustment of its user preference module in various streaming recommendation scenarios.

As a specific solution to the aforementioned process, we propose a novel contextual bandit algorithm called HyperBandit, which consists of two levels of model structures: 1). A bandit policy is designed to learn the latent features of items in an online fashion and combine them with the user preference matrix to execute online recommendations with effective exploration. 2). A hypernetwork takes the information of the time period as inputs and generates the parameters of the user preference matrix in the bandit policy. This hypernetwork captures the periodicity of user preferences over time and enables efficient online updating through low-rank factorization.”