Knowledge Retrieval and RAG Recommender Systems

Can we distill LLM knowledge into graphs for real-time recommendations?

E-commerce needs sub-millisecond recommendations, but LLMs are too slow. Can we extract LLM insights offline into a knowledge graph that serves requests in production without sacrificing quality or explainability?

Note · 2026-05-03 · sourced from Recommenders Architectures
What breaks when specialized AI models reach real users? How should retrieval and reasoning integrate in RAG systems?

E-commerce recommendation has tight latency constraints — typically tens of milliseconds per request. Calling an LLM at request time is unacceptable for these systems. But LLMs have world knowledge that's expensive to extract from interaction data alone. For example, the relation "carnations are the official flower for Mother's Day gift" is hard to mine from clickstream data because customers don't explicitly say "I'm buying this for my mother." But an LLM trained on web text knows this relation directly.

LLM-PKG bridges the latency gap by distilling LLM knowledge offline into a product knowledge graph (PKG). At ingestion time, the LLM is given curated prompts about products, its responses are mapped to enterprise products, and the resulting relations populate the graph. At query time, the recommender uses the graph rather than the LLM — sub-millisecond traversal instead of seconds-long generation.

The hallucination risk is real and is treated as the central problem: LLMs invent relations that don't exist. The mitigation is rigorous evaluation and pruning before populating the graph. The graph is the safety boundary — only relations passing evaluation make it in.

The architecture pattern is general beyond e-commerce: when an LLM has knowledge a downstream system needs but the system can't tolerate LLM latency, distill the knowledge into a static structure (graph, table, embedding store) at offline time. The LLM operates as an offline knowledge-extractor; the production system operates on the extracted artifact. This decouples knowledge breadth (LLM provides) from inference latency (the structure provides). The trade-off is staleness — the graph reflects the LLM at extraction time, not later — but for slowly changing domains the trade-off is favorable.


Source: Recommenders Architectures

Related concepts in this collection

Concept map
15 direct connections · 116 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

LLM-distilled product knowledge graphs offer real-time-feasible explainable recommendations — direct LLM calls are too latency-bound for production e-commerce