Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Paper · arXiv 2406.08464 · Published June 12, 2024
AlignmentSynthetic DialogSelf Refinement Self Consistency Feedback

Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named MAGPIE. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the pre-query templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We further introduce extensions of MAGPIE for filtering, generating multi-turn, preference optimization, domain-specific and multilingual datasets.

Is it possible to synthesize high-quality instructions at scale by directly extracting data from advanced aligned LLMs? A typical input to an aligned LLM contains three key components: the pre-query template, the query, and the post-query template. For instance, an input to Llama-2-chat could be “[INST] Hi! [/INST]”, where [INST] is the pre-query template and [/INST] is the post-query template. These templates are predefined by the creators of the aligned LLMs to ensure the correct prompting of the models. We observe that when we only input the pre-query template to aligned LLMs such as Llama-3-Instruct, they self-synthesize a user query due to their auto-regressive nature. Our experiments indicate that these random user queries are of high quality and great diversity, suggesting that the abilities learned during the alignment process are effectively utilized.

Based on these findings, we developed a self-synthesis method to construct high-quality instruction datasets at scale, named MAGPIE (as illustrated in Figure 1). Unlike existing methods, our approach does not rely on prompt engineering or seed questions. Instead, it directly constructs instruction data by prompting aligned LLMs with a pre-query template for sampling instructions.

An instance of instruction data consists of at least one or multiple instruction-response pairs. Each pair specifies the roles of instruction provider (e.g., user) and follower (e.g., assistant), along with their instruction and response. As shown in Figure 1, MAGPIE consists of two steps: (1) instruction generation, and (2) response generation. The MAGPIE pipeline can be fully automated without any human intervention, and can be readily adapted for the generation of multi-turn, preference, and domain-specific datasets, as detailed in Section 2.2. We describe each step in the following. <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful Al assistant. The user will engage in a multi−round conversation with you, asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and insightful responses to help the user with their queries.<|eot_id|><|start_header_id|>user<|end_header_id|> {instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {response}<|eot_id|><|start_header_id|>user<|end_header_id|>