Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

Paper · Source

Instruction tuning is critical to large language models (LLMs) for achieving better instruction following and task adaptation capabilities but its success heavily relies on the training data quality. Many recent methods focus on improving the data quality but often overlook the compatibility of the data with the student model being finetuned. This paper introduces Selective Reflection-Tuning, a novel paradigm that synergizes a teacher LLM’s reflection and introspection for improving existing data quality with the data selection capability of the student LLM, to automatically refine existing instruction-tuning data. This teacher-student collaboration produces high-quality and student-compatible instruction-response pairs, resulting in sample-efficient instruction tuning and LLMs of superior performance. Selective Reflection-Tuning is a data augmentation and synthesis that generally improves LLM finetuning and self-improvement without collecting brand-new data.

Is the teacher-refined data compatible to the needs of the student model? These approaches typically do not account for the inherent randomness and potential degradation associated with the generative models’ output, leading to an oversight in how the student model responds to these “improved” data samples. Thus a mechanism for the student model to selectively integrate these enhancements has been notably absent. To bridge this gap, our work introduces an teacher-student collaboration pipeline wherein a teacher generative model engages in a reflection process to enhance both the instruction and response of a data sample. The student model then evaluates whether to incorporate these improvements based on its unique statistical attributes. This pipeline is versatile and can be adapted to various contexts where data enhancement is needed.

Then, another pivotal question arises: How does the student model decide which enhanced data are needed and critical to its training? This question underpins the challenge of autonomously evaluating the quality of instructions and responses. Common practices involve utilizing sophisticated models like GPT-4 for assessment purposes (Zheng et al., 2023; Li et al., 2023e; Liu et al., 2023c; Chiang and Lee, 2023) or employing a secondary judge model equipped with evaluative capabilities (Wang et al., 2023c; Li et al., 2023a). These methods, however, present limitations: they fail to address the discrepancies between the evaluating model and the actual student model undergoing training. Particularly in the latter approach, even though the judge model and the student model might share the same structural framework, their weight distributions diverge once endowed with the evaluative functions.

demonstrated that LLMs can refine themselves using self-generated data, leading to enhanced performance and efficiency without human intervention, resulting in self-improving LLMs (Huang et al., 2023). It is crucial to select high-quality samples from the vast amounts of generated data. There are different ways to acquire the quality signals of data samples: 1) LLMs (Yuan et al., 2024); 2) ranking or reward models (Dong et al., 2023; Lu et al., 2024); and 3) some automatic methods such as execution feedback (Haluptzok et al., 2022) or self-consistency in multiple reasoning paths (Wang et al., 2023b). One distinct aspect of our work is that we are working on general instruction-following data optimization iteratively, where the answers are not verifiable like math and coding problems, making most of previous studies infeasible and unaffordable (Huang et al., 2023; Li et al., 2024a).