Reasoning-Driven Synthetic Data Generation and Evaluation

Paper · arXiv 2603.29791 · Published March 31, 2026

Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution – limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties.

Introduction. Data availability and access have been central to advances in artificial intelligence research. In recent years, the abundance of highly-diverse internet data enabled the development of increasingly capable generalist models (Gemini et al., 2023; OpenAI et al., 2023; Anthropic, 2024; Touvron et al., 2023). Despite these models’ impressive versatility, widespread integration will require them to specialize on novel, uncommon, and privacy-sensitive applications. Unfortunately, specialized data in these areas is often intrinsically scarce or inaccessible, motivating enormous investments by frontier research labs (Paul & Tong, 2024; Wiggers, 2024; Cottier et al., 2025) and the rapid rise of dedicated “data foundries” (Liu, 2025; Vinn & Hu, 2025). However, creating specialized datasets manually is expensive, time-consuming, and error-prone (Chen et al., 2023; Gilardi et al., 2023; Hosking et al., 2024), leading many to consider synthetic data as a promising, scalable alternative (Singh et al., 2024a; Abdin et al., 2024; Guo et al., 2025).

Discussion / Conclusion. Synthetic Data Generation Has No Single Optimal Solution. “Data” is a frozen reflection of reality as it was or could be. No wonder then, that there is no single “optimal” way to generate an infinite number of possibilities. Our extensive experiments show that the impact of different data properties, e.g., complexity and diversity, depends on the target domain, the model, use case, scale, and likely many other factors. Instead of looking for silver bullets, we should thus design our synthetic data systems to be as flexible as the worlds we intend to capture. With Simula we offer a system that maintains explainability and control at scale, giving practitioners the tools to customize synthetic data to fit their unique requirements. Synthetic Data Evaluation is a Multi-faceted Challenge. The evaluation of synthetic data is fundamentally challenging due to the ambiguity of its core objectives, the coarse level of existing metrics, and its disconnect from practical context. Key properties describing “good data” are ambiguously defined and inherently entangled. For instance, one could argue that covering rare domain instances falls under “diversity.”

Reasoning-Driven Synthetic Data Generation and Evaluation

Synthesis notes that discuss concepts related to this paper