Does pretraining data size matter less than base model scale for finetuning?

This explores whether, when you fine-tune a model, you get more mileage out of starting from a bigger base model than from one that was pretrained on more data — and the corpus has a surprisingly direct answer.

This explores whether starting from a bigger base model beats starting from one trained on more data when you fine-tune. The most direct evidence says yes: systematic experiments from 1B to 16B parameters find that fine-tuning follows a multiplicative scaling law, and that a larger base model improves fine-tuning results more than additional pretraining data does — while bolting on more parameter-efficient-tuning parameters barely helps at all How should finetuning scale with model and data size?. So the headline answer leans toward base scale mattering more for the fine-tuning payoff.

But the more interesting story is *why*, and here the corpus pulls the two apart into different jobs. Pretraining and fine-tuning don't scale the same axis: scaling pretraining buys factual knowledge, while scaling fine-tuning buys behavioral helpfulness — and this split has a physical home in the network, with pretraining enriching knowledge in the lower layers and fine-tuning reshaping behavior in the upper ones Do pretraining and fine-tuning scale independently in language models?. That reframes the question. Fine-tuning isn't really *adding* capability; it's activating and steering what pretraining already laid down. LIMA makes this vivid — just 1,000 carefully curated examples on a strong base model match models tuned on orders of magnitude more alignment data, because post-training surfaces existing capabilities rather than building new ones Can careful curation replace massive alignment datasets?.

If the base model is doing the heavy lifting, then *which* base you pick matters more than how much you pretrain or fine-tune. There's a clue about the mechanism: larger models learn rare tasks not because they can represent things smaller ones can't, but because their spare capacity weakens the gradient pressure that would otherwise overwrite slowly-accumulated, rare features — less interference, not more expressivity Why do larger models learn rare tasks better?. That's a capacity story, and it's why a bigger base survives fine-tuning with its knowledge more intact.

This also reframes the *risk* of fine-tuning. If knowledge lives in the lower layers, direct weight updates can corrupt it — which is exactly why decoding-time proxy-tuning preserves pretrained knowledge better, closing most of the alignment gap while leaving base weights untouched Can decoding-time tuning preserve knowledge better than weight fine-tuning?, and why representation fine-tuning that intervenes on frozen hidden states beats weight-editing methods like LoRA at a fraction of the parameters Can editing hidden representations beat weight updates for finetuning?. The trend across all of these: protect what the base knows, and steer lightly.

The caveats worth carrying away. Bigger-base-is-better assumes the data you tune on is *compatible* with the student — teacher-refined data that overshoots a student's learning frontier actively degrades it, even when it's objectively higher quality Does teacher-refined data always improve student model performance? — and method can beat raw scale on narrow skills, as when small models trained with DPO on a teacher's right-and-wrong examples match large models on function calling Can small models match large models on function calling?. So the honest synthesis isn't "data size never matters." It's that for the *fine-tuning return on investment*, base model scale and data *quality* dominate, and raw pretraining data volume is the weakest of the three levers.

Sources 8 notes

How should finetuning scale with model and data size?

Systematic experiments across 1B–16B models reveal finetuning follows a power-based multiplicative scaling law. Larger base models improve finetuning more than more pretraining data, while increasing PET parameters provides minimal benefit.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Why do larger models learn rare tasks better?

Larger models succeed at rare tasks not because they can represent solutions smaller models cannot, but because abundant capacity weakens gradients on common tasks, preventing them from overwriting slowly-accumulating rare-task features. Data-mixture design may be cheaper than scaling.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Does pretraining data size matter less than base model scale for finetuning?

Sources 8 notes

Next inquiring lines