SYNTHESIS NOTE

How should finetuning scale with model and data size?

What scaling laws govern finetuning performance across model size, pretraining data, and finetuning data? Understanding these relationships could guide resource allocation in real-world tuning scenarios.

Synthesis note · 2026-06-03 · sourced from Training Fine Tuning

The inductive biases and scaling properties of finetuning methods are far less understood than pretraining scaling. This study fills that gap systematically across model size, pretraining-data size, finetuning-parameter size, and finetuning-data size, comparing full-model tuning (FMT) and parameter-efficient tuning (PET — prompt tuning, LoRA) in the data-limited regime where model size dwarfs finetuning data. Three findings: (1) finetuning follows a power-based multiplicative joint scaling law between finetuning-data size and each other factor; (2) finetuning benefits more from LLM model scaling than pretraining-data scaling, while PET parameter scaling is generally ineffective; and (3) the optimal finetuning method is highly task- and data-dependent — no universal winner.

The keeper for practitioners is counterintuitive: when you have a fixed finetuning budget, a bigger base model helps more than a model pretrained on more data, and growing the number of PET parameters (more LoRA rank, longer prompts) buys little. The lever is base-model scale and finetuning-data, not adapter size.

This connects the vault's finetuning thread. It sits beside Do pretraining and fine-tuning scale independently in language models? (which capability each stage improves) by specifying the scaling-law form and the ineffectiveness of PET-parameter scaling — relevant to choosing between Can editing hidden representations beat weight updates for finetuning? and weight-based PEFT.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

How should finetuning scale with model and data … Do pretraining and fine-tuning scale independently… Can editing hidden representations beat weight upd…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do pretraining and fine-tuning scale independently in language models? Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.
that note says which capability each stage improves; this one gives the scaling-law form
Can editing hidden representations beat weight updates for finetuning? Does intervening directly on a frozen model's representations offer a better path to parameter-efficient adaptation than current weight-based methods? This challenges the dominant PEFT paradigm by treating representations as the semantic lever instead.
PET-parameter scaling being ineffective motivates more parameter-efficient alternatives like ReFT

How should finetuning scale with model and data size?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4