Can you adapt retrieval models without accessing target data?
Explores whether dense retrieval systems can adapt to new domains using only a textual description, rather than actual target documents—especially relevant for privacy-restricted or competitive scenarios.
Dense retrieval models require labeled query-document pairs to adapt to new domains. In many enterprise contexts, the target collection is unavailable: it may not exist yet, it may be legally restricted (medical records, financial data), or sharing it with a model provider would compromise competitive advantage.
The standard assumption — you need the data to train for the domain — turns out to be false for retrieval. A brief textual description of the target domain is sufficient.
The pipeline: (1) Provide a textual domain description. (2) Use instruction-following LLMs to extract domain properties: document topics, linguistic attributes, source characteristics, terminology patterns. (3) Generate seed documents matching those properties. (4) Iteratively retrieve real-domain-like documents using the seed as query anchor. (5) Generate synthetic queries for the constructed collection. (6) Use pseudo-relevance labels to fine-tune the retrieval model.
The retrieval-augmented approach to domain understanding is key: at step (2), the domain description itself becomes a RAG query to extract structured properties, which are then used to parameterize generation at step (3). Bootstrapping from description through synthesis to training.
Evaluation on five diverse target domains shows that description-based adaptation outperforms existing dense retrieval baselines in the zero-target-access scenario. The approach enables adaptation in precisely the contexts where conventional adaptation is blocked: privacy-sensitive domains, legally restricted data, competitive scenarios.
Source: RAG
Related concepts in this collection
-
Does model access level determine which specialization techniques work?
Different specialization approaches require different levels of access to a model's internals. Understanding this constraint helps practitioners choose realistic techniques for their domain adaptation goals.
description-based adaptation enables white-box-style performance from a grey/black-box access constraint: you describe the domain without sharing the data
-
Can organizing knowledge structures beat raw training data volume?
Does structuring domain knowledge into taxonomies during training enable models to learn more efficiently than simply increasing the amount of training data? This challenges assumptions about scaling knowledge injection.
both show that structured domain knowledge (taxonomy or description) dramatically reduces data requirements; the key is capturing domain structure, not data volume
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
domain adaptation for retrieval is possible without target collection via description-based synthetic data