LLM Reasoning and Architecture Agentic and Multi-Agent Systems Reinforcement Learning for LLMs

Can we generate synthetic data without any seed examples?

Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?

Note · 2026-05-03 · sourced from Data

Existing synthetic data generation methods generally fall into two categories: prompt-engineered approaches that generalize poorly because they require manual customization for each task, or stochastic evolutionary algorithms that lack explainability and control. Both typically require seed examples drawn from the target distribution, which is unrealistic for genuinely novel domains and may hurt global coverage by anchoring the generation to existing examples.

Simula proposes a different decomposition: separate global coverage from local diversity, and address each through a different mechanism. For global coverage, the system constructs a synthetic taxonomy by alternating between three steps — Best-of-N proposal of children nodes given context, separate critique-only calls that exploit the generator-critic gap in LLMs, and level-completion planning to ensure consistent granularity across siblings. The resulting taxonomy provides granular, explainable control: every dataset characteristic maps to a tree node, so users can see and adjust what is covered.

For local diversity and complexity, Simula uses agentic refinement after taxonomic sampling. Sampling strategies define which sub-taxonomies combine sensibly (a horror novel about a troubled cat for toddlers should be filtered out), and "semantic expansion" generates multiple meta-prompts simultaneously to mitigate mode collapse when the requested sample count exceeds unique node-pairs. Quality control happens through pointwise critique with binary verdicts and double-critic rejection sampling for tasks with defined correctness, mitigating sycophancy bias.

The architectural insight is that "good" synthetic data has irreducibly multiple desiderata — quality, diversity, complexity — and that previous approaches optimized only subsets because they used a single mechanism, exactly the problem How do quality, diversity, and complexity affect synthetic data differently? diagnoses. Decomposing coverage (global) from variation (local) and using different mechanisms for each makes all three controllable simultaneously. This unlocks data generation for domains where seed data does not exist, which is precisely the domains where synthetic data is most needed (medicine, finance, law).


Source: Data

Related concepts in this collection

Concept map
15 direct connections · 164 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

seedless synthetic data generation through taxonomic decomposition replaces seed-data dependence with explainable global coverage control