SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

How should finetuning scale with model and data size?

What scaling laws govern finetuning performance across model size, pretraining data, and finetuning data? Understanding these relationships could guide resource allocation in real-world tuning scenarios.

Synthesis note · 2026-06-03 · sourced from Training Fine Tuning

The inductive biases and scaling properties of finetuning methods are far less understood than pretraining scaling. This study fills that gap systematically across model size, pretraining-data size, finetuning-parameter size, and finetuning-data size, comparing full-model tuning (FMT) and parameter-efficient tuning (PET — prompt tuning, LoRA) in the data-limited regime where model size dwarfs finetuning data. Three findings: (1) finetuning follows a power-based multiplicative joint scaling law between finetuning-data size and each other factor; (2) finetuning benefits more from LLM model scaling than pretraining-data scaling, while PET parameter scaling is generally ineffective; and (3) the optimal finetuning method is highly task- and data-dependent — no universal winner.

The keeper for practitioners is counterintuitive: when you have a fixed finetuning budget, a bigger base model helps more than a model pretrained on more data, and growing the number of PET parameters (more LoRA rank, longer prompts) buys little. The lever is base-model scale and finetuning-data, not adapter size.

This connects the vault's finetuning thread. It sits beside Do pretraining and fine-tuning scale independently in language models? (which capability each stage improves) by specifying the scaling-law form and the ineffectiveness of PET-parameter scaling — relevant to choosing between Can editing hidden representations beat weight updates for finetuning? and weight-based PEFT.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

finetuning follows a multiplicative joint scaling law and benefits more from model scaling than pretraining-data scaling