Orchestrating Synthetic Data with Reasoning

Paper · Source

Many AI applications of interest require specialized multi-modal models. Yet, relevant data for training these models is inherently scarce. Human annotation is prohibitively expensive, error-prone, and time-consuming. Meanwhile, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution — limiting scalability and control. In this paper, we introduce Simula: a novel, seedless framework that balances global and local reasoning to generate synthetic datasets. We utilize taxonomies to capture a global coverage space and use a series of agentic refinements to promote local diversity and complexity. Our approach allows users to define desired dataset characteristics through an explainable and controllable process, without relying on seed data. This unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

Data availability and access have been central to advances in artificial intelligence research. Recently, the abundance of highly diverse internet data enabled the development of increasingly capable generalist models (Gemini et al., 2023; OpenAI et al., 2023; Anthropic, 2024; Touvron et al., 2023). Despite these models’ impressive versatility, widespread integration will require them to quickly specialize on novel, uncommon, and critical applications (e.g., medicine, finance, law). Unfortunately, specialized data in these areas is often scarce or inaccessible due to cost or privacy concerns. Creating such datasets manually is expensive, time-consuming, and error-prone (Chen et al., 2023; Gilardi et al., 2023). Synthetic data offers a promising, scalable alternative (Singh et al., 2024a; Abdin et al., 2024; Guo et al., 2025). Nevertheless, how to best balance its various desiderata is an open question.

To optimize generalist models for specific tasks, practitioners typically use techniques such as finetuning (Ziegler et al., 2019; Hu et al., 2022; Chung et al., 2024), distillation (Hinton et al., 2015), reinforcement learning (Christiano et al., 2017; Jaech et al., 2024; Guo et al., 2025), and few-shot prompting (Brown et al., 2020). Each of these approaches relies on the availability of relevant example data. Developing scalable methods that can reliably deliver specialized data on-demand is thus vital to accelerate broader AI adoption. Furthermore, synthetic data can increase control and source-attribution, enabling more targeted optimization (Ruis et al., 2024).

Yet, characterizing “good” synthetic data is intrinsically challenging. Generally, “good” is discussed in terms of quality, diversity, and complexity (Havrilla et al., 2024). However, the precise definitions of these terms is contentious. Instead of describing the usefulness of data (Swayamdipta et al., 2020; Marion et al., 2023a), “quality” commonly refers to how well data points fit specific requirements. For example, if the intention is to generate “an image of a red cat”, does the resulting image have a cat in it, and, is that cat indeed red? Meanwhile “complexity” can refer to how confusing or elaborate a specific data point is (Ethayarajh et al., 2022; Shao et al., 2023), but is often equated with the relative concept of “difficulty”. In the case of our red-cat image, a complex example might be a partially obscured cat, or one lying in the shadows. Finally, “diversity” offers both a global and local perspective: does the generated data globally cover the main factors of interest, and does it locally exhibit sufficient variety within specific factors?

Existing synthetic data generation methods generally optimize only a subset of the above desiderata (Havrilla et al., 2024). They often rely on elaborate custom prompts (Gupta et al., 2024; Xu et al., 2024; Yu et al., 2023), or stochastic, evolutionary algorithms (Mehrotra et al., 2024; Fernando et al., 2024). The former generalizes poorly, while the latter lacks explainability and control. Many approaches further require a large number of “seed examples” drawn from the target dataset, which presents an unrealistic assumption in many real-world cases and might hurt global coverage.

In this work we propose Simula: a holistic approach to synthetic data generation that balances global and local reasoning. Given a target dataset description, Simula maps out a global coverage space using synthetic taxonomies. Then, it applies a series of agentic refinements to promote local diversity and complexity. Finally, it performs double-critic rejection sampling to optimize quality. Our approach is seedless and provides clear notions of explainability and control, essential for optimal data curation. We rigorously test the core reasoning assumptions underlying our approach and demonstrate its efficacy on a series of carefully designed experiments.

Imagine we are interested in creating a dataset with the description y := “A dataset of stories about cats”. Due to the under-specification of y, it is infeasible to exhaustively describe the space of all datasets Y, that fit the description.

For example, a dataset that fits the description above might consist of data points considering, e.g., “cat type”, “story format”, and “intended audience”. In Simula, a multi-modal model (M3) is prompted to propose factors based on a set of human-provided instructions, e.g., a description like y, and/or a sample S of existing data. These factors can be accepted or rejected by a human (or M3).

Using taxonomies provides granular explainability and control of Y compared to random sampling (Figure 1.a). Intuitively, as we increase the number of factors and taxonomy depths, we sharpen our coverage control (Figure 1.b). However, this granular control comes at a potential cost: with every taxonomy expansion we risk “missing” nodes of interest, resulting in the progressive coverage loss depicted in Figure 1.c.

To mitigate potential coverage loss resulting from missing nodes, we generate factor taxonomies by alternating between three steps: (1) Given a node, its ancestors and its siblings, an M3 is prompted N times to propose children nodes. This sampling strategy is inspired by the “Best-of-N” literature to increase the proposal distribution and cover edge cases. (2) In a separate call, an M3 is prompted to locally critique the generated nodes, e.g., on completeness, soundness, and specificity, taking advantage of M3s’ observed generator-critic gap (Huang et al., 2024). Finally, (3) after generating all nodes of a specific level, an M3 is prompted to generate a “plan” for the next level. This last step is necessary to enable consistent and fast parallel generation, e.g., by ensuring a similar degree of granularity at different node expansions on the same level.

2.2 GENERATING CONTROLLABLE AND EXPLAINABLE SYNTHETIC DATA AT SCALE

To generate a synthetic dataset that fits our requirements, we distinguish between two phases: taxonomic sampling (Figure 2.c) and agentic refinement (Figure 2.d-e). Initially, an M3 formulates a plan composed of sampling strategies. A strategy defines which taxonomies can be combined together, and with which weights. This is important, as not all sub-taxonomies make sense to combine (e.g., writing a horror novel about a troubled cat for toddlers seems ill-advised). A practical application of strategies could involve aiming for an equal split between kid and adult audiences, where the M3 might propose two strategies, filtering inappropriate formats like “horror” from the kids’ strategy. The generation pipeline then samples a strategy and nodes from the corresponding taxonomies Tj . These sampled “requirements”, along with the original dataset instructions y, guide an M3 to construct one or more “meta prompts”. For example, M3(y; house cat, poem, travel enthusiast), becomes “Compose an exciting haiku about a house cat who goes on an adventure”. Finally, these meta prompts direct an M3 to generate the data outputs.

Optimizing Local Diversity and Complexity. Imagine we want to construct a dataset of size N = 100, and our factor and strategy selection has yielded T = 200 unique node-pairs. Since N < T, our sampling budget allows for at most 100 unique node-pairs with a single meta prompt each, resulting in a global coverage rate of 100=200 = 0:5. Conversely, for N > T, e.g, N = 800, we can sample up to four meta prompts for each requirement set. As the number of meta prompts per node-pair grows, we increase local diversity. However, as N=T grows, independently generating meta prompts from fixed requirements can lead to mode collapse, i.e., meta prompts that are increasingly similar. This is mitigated by generating multiple meta prompts simultaneously, prompting for maximum sample diversity, then sub-sampling the required fraction. We call this approach “semantic expansion”. Next, we expand the complexity of a fraction of the samples, by prompting the M3 to increase the complexity of the generated meta-prompts and outputs while maintaining our semantic requirements. We refer to this later as “complexity expansion. Optimizing local diversity and complexity this way works well for smaller sample sizes, but degrades as N=T grows very large. Instead, for large N=T , Simula can be configured to iteratively prompt for more diverse or complex meta prompts with previous attempts in context. This allows an M3 to reflect on previous generations.

Enhancing Sample Quality with Critics. Next, the system performs a series of agentic refinement steps to optimize sample output quality. It starts with point-wise checks to ensure the generated samples pass the specified semantic and synthetic requirements. This involves prompting the M3 to “critique” the generated samples by providing the meta prompt used for generation and requesting an explanation and a binary verdict. For example, given the generated sample for the adventurous house-cat haiku above, the M3 checks if the cat in the story is indeed a house cat, if the output is a haiku, and if adventures were had. For tasks requiring outputs with a defined notion of correctness (e.g., classification or multiple-choice questions), the system employs an additional “double critic” step, which independently assesses correctness and incorrectness to mitigate sycophancy bias (Sharma et al., 2024). Following these “critic refinement” steps, if the M3 responds with a negative verdict, the system either rejects the sample or applies automated modifications based on the explanation, then repeats the critique step.