How does training data distribution determine what models can learn?

This explores how the makeup of training data — what's in it, what's missing, and how it overlaps with what a model already knows — sets the ceiling on what a model can actually learn, rather than just how much data it sees.

This explores how the makeup of training data — not just its volume but its overlap with what a model already knows — determines what a model can learn. The corpus pushes back hard on the intuition that more data, or higher-quality data, automatically means more learning. The decisive variable turns out to be the *relationship* between the data and the model's current state.

The sharpest version of this is the discovery that a sample's teaching value isn't fixed — it depends on the gap between the problem's difficulty and the model's present ability. Medium-difficulty problems sit in a productive band that drifts as the model improves, so the same dataset teaches different things at different moments How does model ability change what samples teach?. Push too far past that band and learning inverts: nearly-impossible problems don't stretch a model, they teach degenerate shortcuts — answer repetition, skipped computation — that then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. This is why curated *subsets* can beat full datasets. Selecting the 5% of instruction examples whose gradients align with a target capability outperforms training on everything, because mixed data contains examples that actively pull reasoning strategy in the wrong direction Can we train better models on less data?. Even objectively better data hurts when it lands beyond the student's learning frontier — quality is relative to the learner, not absolute Does teacher-refined data always improve student model performance?.

Distribution also governs *what dimension* of the data gets absorbed. Chain-of-thought training is a striking case: models tolerate 50% corrupted numbers with barely any accuracy loss but collapse when reasoning steps are shuffled — meaning what distills across examples is logical structure, the sequencing of steps, not factual content What do models actually learn from chain-of-thought training?. Similarly, the *style* embedded in training traces propagates: teachers conditioned on the right answer produce confident, terse demonstrations, and students inherit that confidence — gaining in-domain sharpness while losing the epistemic caution needed for out-of-distribution problems Does richer teacher context hurt student generalization?. The distribution doesn't just carry facts; it carries habits.

A second thread is how training *reshapes* the distribution the model already carries from pretraining, narrowing the space of what it can express. Reinforcement learning doesn't invent new formats — it amplifies one dominant format already latent in pretraining and suppresses the alternatives within a single epoch, with the winner determined by model scale rather than performance Does RL training collapse format diversity in pretrained models?. Outcome-based reward sharpens the policy globally, and that lost diversity bleeds from solved problems onto unsolved ones, shrinking the exploration a model needs to discover anything new Does outcome-based RL diversity loss spread across unsolved problems?. Staying close to the base distribution — low KL drift — preserves the plasticity to keep learning later tasks, while parameter-only methods that drift far stall out when the domain shifts Does staying close to the base model preserve learning ability?.

The deepest cut is that what looks like *learning* may just be the training distribution showing through. Much of LLMs' apparent few-shot ability disappears on tasks released after the training cutoff — on genuinely unseen tasks they rarely beat simple baselines, suggesting the skill was partly an illusion of having already seen the test How much of LLM few-shot ability comes from training data?. And distribution-fit can flip hierarchy entirely: a small BERT student trained on enough teacher-labeled data beats its LLM teacher, because the student's broader input-distribution exposure generalizes better than the teacher ever did Can smaller models outperform their LLM teachers with enough data?. Most intriguingly, models learn better from data they generate themselves than from a stronger external model's output — self-generated data is restructured to fit the learner's own representations, suggesting the question isn't just what's in the distribution, but whether the distribution speaks the learner's native language Does self-generated training data improve model learning?.

Sources 12 notes

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can we train better models on less data?

LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

What do models actually learn from chain-of-thought training?

Controlled ablations show models tolerate 50% corrupted numbers (3.2% accuracy loss) but fail under step shuffling (13.3% loss). What distills across reasoning demonstrations is logical architecture—how steps sequence and connect—not factual accuracy.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

How much of LLM few-shot ability comes from training data?

LLMs perform better on datasets released before their training cutoff than after, confirmed by membership inference and data inspection. On truly uncontaminated tasks, LLMs rarely beat simple baselines, suggesting few-shot learning may be largely an illusion from having seen training examples during pretraining.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Does self-generated training data improve model learning?

SEAL demonstrates that models learn better from synthetic data they generate themselves than from data created by stronger external models. Self-generated data improved QA performance from 33.5% to 47.0%, suggesting that model-specific restructuring aligns with the learner's representational needs.

How does training data distribution determine what models can learn?

Sources 12 notes

Next inquiring lines