Recommender Systems

Can recommendation metrics train language models directly?

Explores whether LLMs can be optimized through closed-loop reinforcement learning using real recommendation system outputs as rewards, rather than relying on expensive proprietary model distillation.

Note · 2026-05-18 · sourced from Recommenders Conversational

Most existing approaches that combine LLMs with recommendation systems treat the two as disjoint components. The LLM generates something — a query rewrite, a candidate list, a justification — and a downstream recommendation system consumes it. There is no closed feedback loop between LLM generation and recommendation performance. As a result, LLMs are typically optimized using proxy objectives (predicting GPT-4 outputs via SFT, matching synthetic preferences) rather than being trained on the actual goal: improving recommendation quality.

Rec-R1 changes this by making the recommendation system itself the reward source for RL training. The LLM generates a textual output (rewritten query, candidate retrieval, profile extraction). The recommendation model consumes it and returns a rule-based performance metric — NDCG, Recall, or whatever ranking measure the deployment targets. That metric is transformed into a reward signal, and the LLM is optimized via RL to maximize it.

Two structural properties make this viable. First, the approach is model-agnostic: it integrates with sparse retrievers (BM25), dense models, hybrid pipelines, or any architecture whose ranking quality is measurable. The recommender's internal structure is irrelevant — only its output metric matters. Second, it relies solely on black-box feedback: no gradients, no internal parameters, no model surgery. This makes deployment on top of existing production systems straightforward.

The practical consequence: the dependence on SFT from proprietary distillation evaporates. Previous LLM-for-recommendation systems required constructing SFT data by querying GPT-4 or similar proprietary models to generate ground-truth examples. That process is expensive, brittle, and creates a dependency on the proprietary model's quality. Rec-R1 eliminates the SFT step entirely — the generative model is optimized directly through interactions with the recommendation system it serves.

The pattern generalizes beyond recommendation. Any deployment where a downstream system produces a measurable performance metric can serve as the reward source for upstream LLM generation. The closed-loop RL architecture is broader than its first application.

Related concepts in this collection

Concept map
14 direct connections · 94 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

recommendation systems can serve as black-box RL reward sources for LLM generation — closed-loop RL with NDCG and Recall metrics replaces SFT from proprietary distillation