Does training data format matter more than who generates it?

This explores whether the *form* of training data — how it's structured and presented — shapes a model's behavior more than the *source* of that data (self vs. external teacher, human vs. synthetic).

This explores whether the *form* of training data matters more than its *source*. The corpus suggests both axes matter, but in different ways — and the surprising answer is that format effects can be enormous, while source effects turn out to be less about "who is stronger" and more about "who fits the learner."

The sharpest evidence for format comes from work showing that how data is presented shapes reasoning strategy roughly 7.5 times more than the domain it covers: multiple-choice formatting pushes models toward breadth-first exploration, while free-form data produces depth-first reasoning Does training data format shape reasoning strategy more than domain?. Format isn't just a packaging choice — it gets baked into *how the model thinks*. Reinforcement learning then doubles down on this: RL tends to collapse onto a single dominant format inherited from pretraining within the first epoch, suppressing alternatives, and the winner is chosen by model scale rather than by performance Does RL training collapse format diversity in pretrained models?. So format isn't just influential; it's a channel through which training silently narrows behavior.

But the "who generates it" question has its own twist, and it cuts against the intuition that a stronger source is always better. Models often learn *more* from data they generate themselves than from data produced by a stronger external model — SEAL lifts QA accuracy from 33.5% to 47.0% precisely because self-restructured data matches the learner's own representational needs Does self-generated training data improve model learning?. The same logic explains why teacher-refined data can *hurt*: refinements that exceed the student's learning frontier degrade performance even when they're objectively higher quality, so students should filter for compatibility rather than absorb everything Does teacher-refined data always improve student model performance?. In other words, source matters — but as a question of *fit to the learner*, not raw strength. The flip side appears at scale: with enough teacher-labeled data, a small BERT cross-encoder can actually surpass its LLM teacher, because broad input coverage smoothed by teacher predictions generalizes better Can smaller models outperform their LLM teachers with enough data?.

So neither axis dominates cleanly — they interact. The generation *method* (a format-like property) often matters more than the generator's identity. Aligned models can self-synthesize human-quality instruction data from nothing but formatting tokens Can aligned LLMs generate their own training data?, and synthetic generation succeeds or fails based on structural choices: seeding atomic task elements instead of full examples Can synthetic data replace seed examples in task generation?, or sampling tools from relevance graphs with planned dialogue instead of random composition Why does random tool sampling produce unrealistic synthetic training data?. And difficulty calibration — another format-adjacent property — can quietly corrupt capabilities when overly hard samples push models toward degenerate shortcuts Do overly hard RLVR samples actually harm model capabilities?.

The one place where *source* clearly trumps format is at the extremes of data provenance. Recursive training on AI-generated content causes irreversible collapse of the distribution's tail no matter how well-formatted it is, making genuine human data increasingly precious Does training on AI-generated content permanently degrade model quality?. And a deeper view reframes the whole debate: if language modeling is equivalent to lossless compression, then what training data really teaches is general structure, not domain content — text-only models can out-compress dedicated image tools Can text-trained models compress images better than specialized tools?. The takeaway worth carrying away: format isn't surface decoration and source isn't a quality ranking — both are really proxies for *what structure the learner can absorb*, and that's the variable doing the real work.

Sources 11 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does self-generated training data improve model learning?

SEAL demonstrates that models learn better from synthetic data they generate themselves than from data created by stronger external models. Self-generated data improved QA performance from 33.5% to 47.0%, suggesting that model-specific restructuring aligns with the learner's representational needs.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Can aligned LLMs generate their own training data?

MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.

Can synthetic data replace seed examples in task generation?

TarGEN generates synthetic data using atomic task elements (instance seeds) instead of full input-output examples, achieving 1-3 point improvements on SuperGLUE tasks. The approach works by constraining label generation after seeding inputs, enabling data creation for domains with no prior examples.

Why does random tool sampling produce unrealistic synthetic training data?

Random tool sampling fails because unrelated tools cannot credibly compose, and Q&A framing ignores multi-turn dialogue coherence. ToolFlow shows that sampling tools from relevance graphs and generating with dialogue plans closes this gap.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does training on AI-generated content permanently degrade model quality?

Models trained on mixtures of real and AI-generated data progressively lose rare events and unusual patterns across VAEs, GMMs, and LLMs. Each generation compounds the loss, making genuine human data increasingly valuable.

Can text-trained models compress images better than specialized tools?

Chinchilla models trained exclusively on text achieve better compression rates on images and audio than FLAC and PNG by using their context window to adapt as task-specific compressors. This demonstrates that generalization operates through compression, not specialization.

Does training data format matter more than who generates it?

Sources 11 notes

Next inquiring lines