Self-Alignment with Instruction Backtranslation

Paper · arXiv 2308.06259 · Published August 11, 2023
Alignment

instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.

The model is first used to self-augment its training set: for each web document, it creates an instruction following training example by predicting a prompt (instruction) that would be correctly answered by (a portion of) that document. Directly training on such data (similarly to Köksal et al. [2023]) gives poor results in our experiments, both because of the mixed quality of human written web text, and noise in the generated instructions. To remedy this, we show that the same seed model can be used to self-curate the set of newly created augmentation data by predicting their quality, and can then be self-trained on only the highest quality (instruction, output) pairs. The procedure is then iterated, using the improved model to better curate the instruction data, and re-training to produce a better model.

The unlabelled data is a large, diverse set of human-written documents which includes writing about all manner of topics humans are interested in – but crucially is not paired with instructions. A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that we can predict instructions for these candidate gold answers that can be used as high quality example pairs to train an instruction following model. Our overall process, which we call instruction backtranslation, thus performs two core steps:

  1. Self-augment: Generate instructions for unlabelled data, i.e. the web corpus, to produce candidate training data of (instruction, output) pairs for instruction tuning.

  2. Self-curate: Self-select high quality demonstration examples as training data to finetune the base model to follow instructions. This approach is done iteratively where a better intermediate instruction-following model can improve on selecting data for finetuning in the next iteration.

For each unlabelled example yi, we run inference on the backward model to generate a candidate instruction ˆ xi from which we derive the candidate augmented paired data A := {( ˆ xi, yi)}. As we will see in experiments, not all of these candidate pairs are of high quality, and in that case using them all for self-training may not be beneficial.