Do sample-level similarities between pretraining and downstream tasks explain the frequency effect?

This explores whether downstream performance is really driven by how often related concepts and patterns showed up in pretraining — i.e., whether the 'frequency effect' is interpolation/memorization rather than genuine generalization.

This explores whether the boost a model gets on a downstream task is mostly explained by how often similar examples appeared during pretraining, rather than by any real new capability. The corpus points strongly toward yes — frequency of exposure does much of the heavy lifting, and several notes converge on this from different angles.

The most direct evidence comes from multimodal models: across 34 models and 5 datasets, zero-shot performance tracks how often a test concept appeared in the pretraining data, and models need *exponentially* more data for each linear gain on downstream tasks Does multimodal zero-shot performance actually generalize or interpolate?. That's the signature of interpolation, not generalization — the model is good at what it has seen a lot of. A parallel finding at the token level: whether a keyword gets 'primed' after a gradient update is predictable from its probability *before* learning, with a sharp ~10^-3 threshold separating contexts where learning sticks from those where it doesn't Can we predict keyword priming before learning happens?. So the pre-existing statistical footprint of a concept governs whether new training even takes hold.

The more surprising thread is what's actually transferring when you adapt a model. Instruction tuning, it turns out, can be done with semantically *empty or wrong* instructions and still hit nearly identical performance — what transfers is familiarity with the output space, not task understanding Does instruction tuning teach task understanding or output format?. RL post-training shows the same shape from another direction: rather than inventing behavior, it amplifies one format distribution that already dominated pretraining while suppressing the alternatives Does RL training collapse format diversity in pretrained models?. In both cases downstream gains are largely a re-weighting of distributions the model already carried.

Why similarity matters as much as raw frequency shows up in the teacher-student work: refined training data degrades a student when it falls outside that student's existing 'learning frontier,' even when the data is objectively better Does teacher-refined data always improve student model performance?. The benefit isn't in the data's quality — it's in its proximity to what the model already represents. That's the sample-level similarity story made mechanical.

The useful caveat is that frequency isn't destiny. Where capability lives matters: pretraining scale drives factual knowledge in lower layers while fine-tuning scale shifts behavior in upper layers, so the two don't simply collapse into one frequency dial Do pretraining and fine-tuning scale independently in language models?. And methods like baking chain-of-thought reasoning into pretraining itself suggest some capabilities can be *planted* rather than merely surfaced from frequency Can chain-of-thought reasoning be learned during pretraining itself?. The honest reading: the frequency effect is largely a similarity effect — downstream wins ride on overlap with pretraining — but architecture and where you intervene leave room for genuine new capability on top.

Sources 7 notes

Does multimodal zero-shot performance actually generalize or interpolate?

Across 34 models and 5 datasets, multimodal models require exponentially more pretraining data for linear performance gains on downstream tasks. Performance correlates with how often test concepts appeared during pretraining, not genuine generalization ability.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Do sample-level similarities between pretraining and downstream tasks explain the frequency effect?

Sources 7 notes

Next inquiring lines