Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning

Paper · arXiv 2503.24289
Conversational RecommendersReinforcement LearningPersonalization (General)

We propose Rec-R1, a general reinforcement learning framework that bridges large language models (LLMs) with recommendation systems through closed-loop optimization. Unlike prompting and supervised fine-tuning (SFT), Rec-R1 directly optimizes LLM generation using feedback from a fixed, black-box recommendation model—without relying on synthetic SFT data from proprietary models like GPT-4o. This avoids the substantial cost and effort required for data distillation. Most existing approaches still treat LLMs and recommendation models as disjoint components, with no closed feedback loop between LLM generation and recommendation performance. As a result, LLMs are typically optimized using proxy objectives rather than being directly trained using feedback from RecSys, which is often inconsistent with the ultimate goal of improving recommendation quality.

Rec-R1 enables LLMs to learn directly from recommendation feedback—such as retrieval or ranking metrics—thereby aligning the generation process with the ultimate goal of improving recommendation quality. Specifically, given any recommendation-relevant input, such as a user query or behavioral history, an LLM generates a textual output that is consumed by a downstream recommendation model. The recommendation system then evaluates the quality of the LLM-generated text using rule-based performance metrics (e.g., NDCG, Recall), which are transformed into reward signals for optimizing the LLM via RL. Through repeated interaction with the recommendation system, the LLM gradually learns to generate inputs that are better aligned with the system’s objectives, thereby improving recommendation performance without relying on suboptimal intermediate supervision.

Rec-R1 is model-agnostic and task-flexible: it can be integrated with a wide range of recommendation architectures—including sparse retrievers (e.g., BM25), dense discriminative models, and hybrid pipelines—without requiring any modifications to their internal structures. It also supports diverse generation tasks as long as the generated text can be consumed by the downstream recommendation system. Moreover, since Rec-R1 relies solely on black-box feedback in the form of recommendation performance metrics, it does not require access to model gradients or internal parameters, making it easy to deploy on top of existing production systems. It also eliminates the need for constructing SFT data, allowing the generative model to be optimized directly through interactions.

LLMs Can Learn to Recommend Without Access to the Item Space. In our product search experiments, Rec-R1 operates without any access to the downstream item catalog—it only receives the user query and generates a rewritten query, without knowing which products exist in the recommender’s database. Despite this apparent limitation, Rec-R1 consistently delivers strong performance across domains. This aligns surprisingly well with human behavior: when people search for products, they rarely know the exact contents of a platform’s inventory. Instead, they refine their queries iteratively based on vague goals and system feedback. Rec-R1, trained in a closed loop with the recommender, learns this refinement process efficiently via reinforcement learning. A core advantage of Rec-R1 lies in its ability to leverage feedback signals from recommendation systems. In practice, maintaining an up-to-date stream of logs allows Rec-R1 to stay aligned with evolving user preferences and content trends. Moreover, Rec-R1 is fully compatible with real-time feedback: it can be trained via online interactions with a live recommendation engine, where the LLM receives immediate performance signals (e.g., engagement rates or conversions). This makes Rec-R1 a flexible framework capable of serving as a foundation for LLM-based recommendation systems that evolve with real-world usage.

For the NEWS ARTICLE domain, there is a noticeable peak at the beginning of the documents which suggests little context is needed to identify the cause. This aligns with the typical structure of news articles where crucial information is introduced early to capture the reader’s interest. As a result, readers may have immediate questions from the onset. Conversely, in the CONVERSATION domain, the distribution peaks at the end, suggesting that more context from the conversation is needed to identify the cause. Finally, in the LECTURE domain, the distribution is relatively uniform which suggests a broader contextual dependence.