LESS: Selecting Influential Data for Targeted Instruction Tuning
Instruction tuning has unlocked powerful capabilities in large language models (LLMs), using combined datasets to develop general-purpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.
Recent efforts curating highly diverse and wide-ranging instruction tuning datasets (Taori et al., 2023; Wang et al.; Mukherjee et al., 2023; Xu et al., 2023, inter alia) induce remarkably strong generalization even from a small number of examples (Zhou et al., 2023). Regardless, it remains an open problem to understand how to best utilize these various datasets.
Many real-world applications call for cultivating a specific suite of capabilities in LLMs (e.g., reasoning skills). However, training LLMs with mixed instruction tuning datasets can hinder the development of these specific capabilities. For example, Wang et al. (2023b) demonstrates that LLMs trained on a mix of instruction tuning datasets exhibit worse performance than those trained on a subset of the data. Additionally, considering the broad spectrum of user queries and the multitude of skills required to respond to them, there may not always be enough in-domain data available.
Therefore, we hope to be able to effectively use the general instruction tuning data to improve specific capabilities. We frame this setting as targeted instruction tuning:
Given just a handful of examples embodying a specific capability, how can we effectively select relevant fine-tuning data from a large collection of instruction datasets?
We approach this problem by prioritizing training on data that directly minimizes loss on a target task instead of relying on surface form features (Gururangan et al., 2020; Xie et al., 2023b). Inspired by past works estimating the influence of individual training datapoints with gradient information (Pruthi et al., 2020; Han et al., 2023), we design an optimizer-aware approach to select such data. However, straightforward application of this influence formulation faces several challenges unique to the instruction tuning setting: (1) LLMs are traditionally fine-tuned with the Adam optimizer (Kingma & Ba, 2015) instead of the canonical SGD optimizer; (2) using sequence-level gradients of variable-length instruction data can derail the influence estimation; and (3) the large number of trainable parameters in LLMs makes the computation and storage of gradient information extremely resource-intensive.