Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper · arXiv 2406.20094 · Published June 28, 2024

Therefore, to create diverse synthetic data at scale (e.g., 1 billion diverse math problems), a large number of diverse prompts are needed. Previous research tends to diversify the data synthesis prompt through the following two paradigms, but unfortunately, neither can practically achieve scalable synthetic data creation:

• Instance-driven: This approach diversifies the data synthesis prompt by leveraging a seed corpus (i.e., creating new instances based on the instances in the seed corpus). Representative studies include Wang et al. (2022) and Yu et al. (2023). However, under this paradigm, the diversity of the synthesized data mainly comes from the seed instances, making it difficult to truly extend beyond the seed corpus. Given the limited size of a seed corpus in most practical scenarios, it is challenging for this paradigm to scale up the creation of synthetic data.

• Key-point-driven: This approach diversifies the data synthesis prompt with a curated comprehensive list of key points (or concepts) that can be a topic, a subject, or any knowledge we expect synthetic data to encompass.

To facilitate research in persona-driven data synthesis, we initially release 200,000 personas from Persona Hub and following synthetic data samples we created with various personas, including:

• 50,000 math problems

• 50,000 instructions

• 10,000 game NPCs

• 50,000 logical reasoning problems

• 10,000 knowledge-rich texts

• 5,000 tools (functions)

We are open to releasing more data when we can better assess the potential risks and concerns, which will be discussed in detail in Section 5.