Clustering-based Sampling for Few-Shot Cross-Domain Keyphrase Extraction
Keyphrase extraction is the task of identifying a set of keyphrases present in a document that captures its most salient topics. Scientific domain-specific pre-training has led to achieving state-of-the-art keyphrase extraction performance with a majority of benchmarks being within the domain. In this work, we explore how to effectively enable the cross-domain generalization capabilities of such models without requiring the same scale of data. We primarily focus on the few-shot setting in non-scientific domain datasets such as OpenKP from theWeb domain & StackEx from the StackExchange forum. We propose to leverage topic information intrinsically available in the data, to build a novel clustering-based sampling approach that facilitates selecting a few samples to label from the target domain facilitating building robust and performant models.
Keyphrases are a set of words that convey the most salient topics of an article or a document, and identification of such keyphrases can be very useful in extracting key information from the long documents through summarization (Zhang et al., 2004; Qazvinian et al., 2010), semantic and faceted search (Gutwin et al., 1999; Sanyal et al., 2019) and document retrieval (Jones and Staveley, 1999). Recently, a lot of work has been done in using language models (LMs) for extracting keyphrases using generative models through keyphrase generation
Task-specific pre-training of LMs for keyphrase extraction requires abundance of supervised data with documents and their corresponding keyphrases. Obtaining human annotated data can be a very expensive, error-prone and an inefficient process, hence a majority of the labelled datasets for keyphrase extraction are from the scientific domain
Fine-tuning with a sufficiently large dataset typically allows the model to generalize well beyond the pre-training domain. However, for low-resource domains, such data can be difficult to obtain at scale. Few-shot learning is a setup extensively explored with very large language models and typically in-context (Brown et al., 2020; Lin et al., 2022; Srivastava et al., 2022), however we focus on the more niche setup of few-shot learning using fine-tuning for sequence tagging with encoderonly models. Keyphrase-aware PLMs are trained to build strong representations for keyphrases in text and we hypothesize that we are able to leverage these embeddings to bootstrap a model by finetuning it only a few-samples from the target domain in order to obtain satisfactory performance.