Can we generate synthetic data without any seed examples?
Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?
Existing synthetic data generation methods generally fall into two categories: prompt-engineered approaches that generalize poorly because they require manual customization for each task, or stochastic evolutionary algorithms that lack explainability and control. Both typically require seed examples drawn from the target distribution, which is unrealistic for genuinely novel domains and may hurt global coverage by anchoring the generation to existing examples.
Simula proposes a different decomposition: separate global coverage from local diversity, and address each through a different mechanism. For global coverage, the system constructs a synthetic taxonomy by alternating between three steps — Best-of-N proposal of children nodes given context, separate critique-only calls that exploit the generator-critic gap in LLMs, and level-completion planning to ensure consistent granularity across siblings. The resulting taxonomy provides granular, explainable control: every dataset characteristic maps to a tree node, so users can see and adjust what is covered.
For local diversity and complexity, Simula uses agentic refinement after taxonomic sampling. Sampling strategies define which sub-taxonomies combine sensibly (a horror novel about a troubled cat for toddlers should be filtered out), and "semantic expansion" generates multiple meta-prompts simultaneously to mitigate mode collapse when the requested sample count exceeds unique node-pairs. Quality control happens through pointwise critique with binary verdicts and double-critic rejection sampling for tasks with defined correctness, mitigating sycophancy bias.
The architectural insight is that "good" synthetic data has irreducibly multiple desiderata — quality, diversity, complexity — and that previous approaches optimized only subsets because they used a single mechanism, exactly the problem How do quality, diversity, and complexity affect synthetic data differently? diagnoses. Decomposing coverage (global) from variation (local) and using different mechanisms for each makes all three controllable simultaneously. This unlocks data generation for domains where seed data does not exist, which is precisely the domains where synthetic data is most needed (medicine, finance, law).
Source: Data
Related concepts in this collection
-
How do quality, diversity, and complexity affect synthetic data differently?
When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.
extends: Simula is a concrete architecture for the QDC framework's prescriptions — different mechanisms per desideratum
-
Can synthetic data replace seed examples in task generation?
Can models generate high-quality synthetic data for novel tasks without relying on existing input-output exemplars? This matters because many specialized domains lack training examples to work from.
extends: TarGEN replaces input-output exemplars with instance seeds; Simula replaces instance seeds with taxonomies — three-step progression away from seed dependence
-
Can synthetic dialogues become realistic through layered diversity?
Explores whether combining persona variation, subtopic specificity, and contextual grounding can generate synthetic dialogues that match real conversational data quality and capture the full spectrum of dialogue diversity.
exemplifies: dialogue-domain instance of the same global-vs-local decomposition Simula generalizes
-
Should persona simulation prioritize coverage over statistical matching?
Explores whether stress-testing AI systems requires spanning rare user configurations rather than replicating aggregate population statistics. Critical for identifying edge-case failures.
complements: same coverage-vs-density distinction at the persona-generation level
-
Why do different LLMs generate nearly identical outputs?
Explores whether diversity in model architectures and training actually produces diverse ideas, or whether shared alignment procedures and training data cause convergence on similar responses.
tension: Simula's mode-collapse mitigation via semantic expansion targets exactly the hivemind tendency, but coverage-by-taxonomy still depends on the generator's own taxonomic intuitions
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
seedless synthetic data generation through taxonomic decomposition replaces seed-data dependence with explainable global coverage control