Can we distill LLM knowledge into graphs for real-time recommendations?
E-commerce needs sub-millisecond recommendations, but LLMs are too slow. Can we extract LLM insights offline into a knowledge graph that serves requests in production without sacrificing quality or explainability?
E-commerce recommendation has tight latency constraints — typically tens of milliseconds per request. Calling an LLM at request time is unacceptable for these systems. But LLMs have world knowledge that's expensive to extract from interaction data alone. For example, the relation "carnations are the official flower for Mother's Day gift" is hard to mine from clickstream data because customers don't explicitly say "I'm buying this for my mother." But an LLM trained on web text knows this relation directly.
LLM-PKG bridges the latency gap by distilling LLM knowledge offline into a product knowledge graph (PKG). At ingestion time, the LLM is given curated prompts about products, its responses are mapped to enterprise products, and the resulting relations populate the graph. At query time, the recommender uses the graph rather than the LLM — sub-millisecond traversal instead of seconds-long generation.
The hallucination risk is real and is treated as the central problem: LLMs invent relations that don't exist. The mitigation is rigorous evaluation and pruning before populating the graph. The graph is the safety boundary — only relations passing evaluation make it in.
The architecture pattern is general beyond e-commerce: when an LLM has knowledge a downstream system needs but the system can't tolerate LLM latency, distill the knowledge into a static structure (graph, table, embedding store) at offline time. The LLM operates as an offline knowledge-extractor; the production system operates on the extracted artifact. This decouples knowledge breadth (LLM provides) from inference latency (the structure provides). The trade-off is staleness — the graph reflects the LLM at extraction time, not later — but for slowly changing domains the trade-off is favorable.
Source: Recommenders Architectures
Related concepts in this collection
-
Can smaller models outperform their LLM teachers with enough data?
Explores whether student models trained on expanded teacher-generated labels can exceed teacher performance in production ranking tasks, and what data scale makes this possible.
extends: same offline-LLM-distillation-into-fast-runtime pattern, applied to KG construction rather than ranking
-
Can graphs unify collaborative filtering and side information?
How might merging user-item interactions with item attributes into a single graph structure allow recommendation systems to capture collaborative and attribute-based signals together, rather than separately?
complements: KGAT is a KG-for-recommendation pattern using interaction-derived attributes; LLM-PKG uses LLM-derived attributes — same architectural family
-
How can real-time recommendations stay responsive and reproducible?
In-session signals improve ranking accuracy, but requiring fresh data during sessions forces real-time computation. This creates latency, network sensitivity, and debugging challenges that offset the relevance gains.
exemplifies: latency constraints driving offline-distillation is the production-side response to the freshness-latency tradeoff
-
Can community detection enable RAG systems to answer global corpus questions?
Standard RAG struggles with corpus-wide questions that require understanding overall themes rather than retrieving specific passages. Can graph community detection overcome this limitation at scale?
complements: GraphRAG distills LLM knowledge into a query-time graph; LLM-PKG distills it into a recommend-time graph — same offline-LLM-into-graph pattern at different downstream tasks
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
LLM-distilled product knowledge graphs offer real-time-feasible explainable recommendations — direct LLM calls are too latency-bound for production e-commerce