LLM Reasoning and Architecture Reinforcement Learning for LLMs Design & LLM Interaction

Can synthetic data replace seed examples in task generation?

Can models generate high-quality synthetic data for novel tasks without relying on existing input-output exemplars? This matters because many specialized domains lack training examples to work from.

Note · 2026-05-03 · sourced from Data

Most synthetic data generation methods require seed examples drawn from the target distribution — actual input-output pairs the model can mimic and extend. This requirement breaks for genuinely novel or highly domain-specific tasks where no existing instances exist. TarGEN proposes a four-step prompting strategy that is seedless in this sense — it does not require specific task instances and therefore broadens applicability to novel domains.

The core distinction is between seed examples (full input-output exemplars demonstrating the task) and instance seeds (atomic elements that form the unique basis of each generated instance). An instance seed can be a sentence, a passage, or a more atomic element — but crucially it is not an input exemplar. The generation process proceeds by initializing a set of contexts to inject semantic diversity, generating task-specific instance seeds, formulating per-seed label constraints, and producing a data instance attributable to the constrained label.

The clever move is the label constraint formulation. Rather than asking the model to produce input-output pairs from scratch (which requires examples to learn the distribution), TarGEN generates the input element first via the instance seed, and then constrains the LLM to produce a corresponding output that matches a specified label. Augmenting this with a self-correction module that lets the LLM rectify inaccurately labeled instances during dataset creation produces reliable labels even without ground-truth data to validate against.

The empirical results show this is not just theoretically appealing. On eight SuperGLUE tasks, models trained on the synthetic version perform 1-3 points higher than those trained on the original datasets, and Llama2 (7B) pre-finetuned on synthetic SuperGLUE surpasses the Self-Instruct dataset baseline by 2.62 points on the OpenLLM leaderboard. The synthetic data shows comparable or higher complexity and diversity, with similar bias levels to original data.

The structural contribution is that "no seed data" is two distinct claims that prior work conflated — no input-output exemplars (which TarGEN achieves) and no per-instance task material at all (which TarGEN does not claim, since instance seeds are still task-specific atoms). Distinguishing these clarifies what kind of data generation is possible without prior task examples and what still requires task-specific scaffolding — a distinction Can we generate synthetic data without any seed examples? pushes further by replacing instance seeds with taxonomy nodes.


Source: Data

Related concepts in this collection

Concept map
14 direct connections · 167 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

instance seeds replace input exemplars in synthetic data generation — atomic elements like sentences or passages permit task replication without requiring existing data instances