CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Paper · arXiv 2507.23751 · Published July 31, 2025

We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics.

human-generated data, being inherently prone to biases and errors, may not always be ideal for model training or evaluation

Synthetic data is artificially generated to replicate the characteristics and patterns of real-world data. One innovative approach to creating such data is the Self-Instruct method (Wang et al., 2022a), which utilizes LLMs themselves to generate instruction-following examples. This method begins by selecting a small set of seed instruction-following samples, which are then used to prompt LLMs to produce additional demonstrations in a similar format. Since then a number of variants have been introduced that increase the complexity of queries (Liu et al., 2023; Zeng et al., 2024), maintain semantic diversity (Ding et al., 2023), scale the synthetic data (Yuan et al., 2023), and use these methods in self-improvement loops (Yuan et al., 2024). However, a significant challenge with these approaches is to ensure the quality and effectiveness of the generated data for language model training. Overall, generating high-quality synthetic data and optimizing its use for both reasoning and non-reasoning tasks still remains insufficiently understood.

In this paper, we present Chain-of-Thought(CoT)-Self-Instruct, a method that both (i) uses reasoning to help create high quality synthetic data; and (ii) self-filters the created data to only keep the highest quality ones, see Figure 1. We show the efficacy of this approach for creating both verifiable reasoning data and non-verifiable instruction following tasks, where in both cases using Chain-of-Thought (CoT) to help generate the examples outperforms those generated without CoT. To curate high quality verifiable data we introduce Answer-Consistency, which discards examples where the CoT-Self-Instruct-generated answer does not match the majority vote solution of the LLM, with the assumption that those examples are either incorrectly labeled or too difficult. For non-verifiable data we use the recent Rejecting Instruction Preferences (RIP) (Yu et al., 2025) method which measures the quality of prompts based on the distribution of reward model scores from LLM solutions. For non-verifiable data we use the recent Rejecting Instruction Preferences (RIP) (Yu et al., 2025) method which measures the quality of prompts based on the distribution of reward model scores from LLM solutions. In both cases, filtering provides further gains.

3.1 SYNTHETIC INSTRUCTION CREATION VIA COT

The process of CoT-Self-Instruct data creation starts with a small set of seed instructions as the instruction pool. Multiple instructions are sampled at random from the instruction pool, and then used to few-shot prompt a language model to generate a series of intermediate reasoning steps, followed by a new instruction. Unlike standard Self-Instruct (Wang et al., 2022a) which directly prompts the model to write new instructions given a list of seed instructions, each time we show the LLM an N-shot set of sample instructions, we first ask it to carefully analyze the given instructions, such as domain, complexity and purpose. After analyzing the seed instructions, reflecting on what makes them high quality prompts, the LLM is prompted to reason step by step to come up with a plan to generate a new self-contained instruction that is of similar quality and complexity as the given seed instructions, and ultimately to output the final synthetic instruction satisfying these requirements in a strict answer format.

Verifiable reasoning tasks For reasoning tasks where there is a deterministic answer which we can compare against to generate verifiable rewards during training, we instruct the LLM to use reasoning to generate both an instruction and the verifiable target. The prompt we used for CoTSelf- Instruct on reasoning tasks is given in Figure 2.

General instruction following tasks For tasks involving general instruction-following with open-ended responses, we direct the LLM to use reasoning to generate only the instruction, not the response itself. In these instances, later during training on this synthetic data we utilize a reward model to assess the responses, eliminating the need for a reference answer. The prompt we used for CoT-Self-Instruct on general instruction following tasks is given in Figure 3. Seed prompt pools for instruction-following typically include various different domains.