Dense Retrieval Adaptation using Target Domain Description

Paper · arXiv 2307.02740 · Published July 6, 2023
RAGDomain Specialization

“This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.”

“In this work we introduce a new category of domain adaptation methods for neural information retrieval, which we refer to as “domain adaptation with description.” Studying this problem is not only interesting from an academic perspective, but also has potential applications in several real-world scenarios, where the target collection and its relevance labels are not available at training time. For example, these may not be available yet or at all or, even if they were, target domain owners may be hesitant to provide them for various reasons such as legal restrictions. There are also applications with privacy concerns, for instance in the case of medical records or where the data contains personally identifiable information. Another example can be found when a competitive advantage is involved, as potential use of the data may benefit competitors.”

We propose a taxonomy for the task and analyze the various ways and attributes by which a domain can be adapted. We differentiate our task from similar studies that have been conducted in recent years and explain the limitations of existing technologies. To address these limitations, we propose a novel pipeline that utilizes the domain descriptions to construct a synthetic target collection and generate queries and pseudo relevance labels to adapt the initial ranking model trained on a source domain. Our approach takes advantage of state-of-the-art instruction-based language models to extract the properties of the target domain based on its given textual description. We show that a retrieval-augmented approach for domain description understanding can effectively identify various properties of each target domain, including the topic of documents, their linguistic attributes, their source, etc. The extracted properties are used to generate a seed document using generative language models and then an iterative retrieval process is employed to construct a synthetic target collection, automatically.

This paper introduced a new category of domain adaptation methods for neural information retrieval and proposed a pipeline that leverages target domain descriptions to construct a synthetic target collection, generate queries, and produce pseudo-relevant labels. The results of experiments conducted on five diverse target collections demonstrated that our proposed approach outperforms existing dense retrieval baselines in such a domain adaptation scenario. This work holds the potential for practical applications where the target collection and its relevance labels are unavailable, while preserving privacy and complying with legal restrictions.