Can synthetic data replace seed examples in task generation?
Can models generate high-quality synthetic data for novel tasks without relying on existing input-output exemplars? This matters because many specialized domains lack training examples to work from.
Most synthetic data generation methods require seed examples drawn from the target distribution — actual input-output pairs the model can mimic and extend. This requirement breaks for genuinely novel or highly domain-specific tasks where no existing instances exist. TarGEN proposes a four-step prompting strategy that is seedless in this sense — it does not require specific task instances and therefore broadens applicability to novel domains.
The core distinction is between seed examples (full input-output exemplars demonstrating the task) and instance seeds (atomic elements that form the unique basis of each generated instance). An instance seed can be a sentence, a passage, or a more atomic element — but crucially it is not an input exemplar. The generation process proceeds by initializing a set of contexts to inject semantic diversity, generating task-specific instance seeds, formulating per-seed label constraints, and producing a data instance attributable to the constrained label.
The clever move is the label constraint formulation. Rather than asking the model to produce input-output pairs from scratch (which requires examples to learn the distribution), TarGEN generates the input element first via the instance seed, and then constrains the LLM to produce a corresponding output that matches a specified label. Augmenting this with a self-correction module that lets the LLM rectify inaccurately labeled instances during dataset creation produces reliable labels even without ground-truth data to validate against.
The empirical results show this is not just theoretically appealing. On eight SuperGLUE tasks, models trained on the synthetic version perform 1-3 points higher than those trained on the original datasets, and Llama2 (7B) pre-finetuned on synthetic SuperGLUE surpasses the Self-Instruct dataset baseline by 2.62 points on the OpenLLM leaderboard. The synthetic data shows comparable or higher complexity and diversity, with similar bias levels to original data.
The structural contribution is that "no seed data" is two distinct claims that prior work conflated — no input-output exemplars (which TarGEN achieves) and no per-instance task material at all (which TarGEN does not claim, since instance seeds are still task-specific atoms). Distinguishing these clarifies what kind of data generation is possible without prior task examples and what still requires task-specific scaffolding — a distinction Can we generate synthetic data without any seed examples? pushes further by replacing instance seeds with taxonomy nodes.
Source: Data
Related concepts in this collection
-
Can we generate synthetic data without any seed examples?
Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?
extends: companion piece — TarGEN replaces input-output exemplars; Simula replaces instance seeds with taxonomies — same direction, different granularity
-
How do quality, diversity, and complexity affect synthetic data differently?
When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.
complements: TarGEN reports comparable QDC to original data; this note tells you which dimensions to look at when comparing
-
Can synthetic dialogues become realistic through layered diversity?
Explores whether combining persona variation, subtopic specificity, and contextual grounding can generate synthetic dialogues that match real conversational data quality and capture the full spectrum of dialogue diversity.
exemplifies: instance-seed-style decomposition applied to dialogue — atomic elements (persona × subtopic × context) drive diversity
-
Can models trained on many imperfect experts outperform each one?
Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.
complements: synthetic data as denoising signal — TarGEN's self-correction module operates at a similar denoising layer
-
Why do different LLMs generate nearly identical outputs?
Explores whether diversity in model architectures and training actually produces diverse ideas, or whether shared alignment procedures and training data cause convergence on similar responses.
tension: instance seeds inject atomic-level variation, but the generator's hivemind tendencies may collapse downstream diversity unless explicitly controlled
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
instance seeds replace input exemplars in synthetic data generation — atomic elements like sentences or passages permit task replication without requiring existing data instances