Efficient Nearest Neighbor Language Models

Paper · arXiv 2109.04212 · Published September 9, 2021
Memory

In this paper, we take the recently proposed k-nearest neighbors language model (Khandelwal et al., 2020) as an example, exploring methods to improve its efficiency along various dimensions. Experiments on the standard WikiText-103 benchmark and domain-adaptation datasets show that our methods are able to achieve up to a 6x speed-up in inference speed while retaining comparable performance.

In contrast, recent non-parametric LMs (Guu et al., 2018; Khandelwal et al., 2020; He et al., 2020) model text distributions by referencing both the parameters of the underlying model and examples from an external datastore. Non-parametric LMs are appealing since they allow for effective language modeling – particularly for rarer patterns – through explicit memorization via a datastore, which mitigates the burden on model parameters to learn to encode all information from a large dataset. One effective and representative example is the k-nearest neighbors LM (kNN-LM, Khandelwal et al. (2020)). The kNN-LM computes the probability of the next token by interpolating a parametric LM with a distribution calculated from the k nearest context-token pairs in the datastore, as demonstrated in Figure 2. This model is particularly notable for its large improvements in performance – it outperforms the previous best parametric LMs by a large margin in standard language modeling benchmarks, in domain adaptation settings, and on other conditional generation tasks such as machine translation (Khandelwal et al., 2021).