Knowledge Distillation for Enhancing Walmart E-commerce Search Relevance Using Large Language Models

Paper · arXiv 2505.07105 · Published May 11, 2025

While large language models (LLMs) offer superior ranking capabilities, it is challenging to deploy LLMs in real-time systems due to the high-latency requirements. To leverage the ranking power of LLMs while meeting the low-latency demands of production systems, we propose a novel framework that distills a highperforming LLM into a more efficient, low-latency student model. To help the student model learn more effectively from the teacher model, we first train the teacher LLM as a classification model with soft targets. Then, we train the student model to capture the relevance margin between pairs of products for a given query using mean squared error loss. Instead of using the same training data as the teacher model, we significantly expand the student model’s dataset by generating unlabeled data and labeling it with the teacher model’s predictions. Experimental results show that the student model’s performance continues to improve as the size of the augmented training data increases. In fact, with enough augmented data, the student model can outperform the teacher model. The student model has been successfully deployed in production at Walmart.com with significantly positive metrics.

Product search ranking engine typically comprises two main stages: the retrieval stage where a set of relevant products candidates is retrieved from the product catalog to form the recall set, the rerank stage where the candidates obtained from the previous stage are re-ranked to form a ranked product list to be shown to customers.

The embedding-based two tower model has become an important approach in the retrieval stage of e-commerce search engines, especially the bi-encoder which uses transformer-encoded representations. Bi-encoder models learn embedding representations for search queries and products from training data. At retrieval time, semantically similar items are retrieved based on simple similarity metrics, such as the cosine similarity between the query and the item. Since the query and document embeddings are learned through separate models, this approach allows the embeddings to be precomputed offline with minimal computational overhead and low latency [1, 37]. In contrast, the cross-encoder model takes the query and item text as input and directly outputs a prediction. Although bi-encoder approaches are typically much faster, they are generally less effective than cross-encoder methods [31, 33]. By concatenating query and item as input, the cross-encoder model can benefit from the attention mechanism across all tokens in the inputs, thus capturing interactions between query and item text. As a result, the trade-off with bi-encoders involves sacrificing the model’s effectiveness for large latency gain. The retrieval stage has strict latency requirements as it needs to search through millions of products in the product catalog. However, the downstream re-rank stage only ranks the products retrieved during the retrieval stage, allowing it to afford more relaxed latency requirements compared to the retrieval stage.

Our previous production model is the first ranking model with a cross-encoder (XE) feature that can better understand the relevance of items to search queries. In this work, we present our our enhanced BERT-Base (𝐵𝐸𝑅𝑇Base) model with cross-encoder architecture, optimized through knowledge distillation from LLMs on an augmented large-scale training dataset.