Can we decouple what pretraining and fine-tuning each improve?
Does scaling at different training stages produce distinct capability improvements? This matters because it could reveal whether knowledge and behavioral alignment are truly separate properties we can optimize independently.
Emulated Fine-Tuning (EFT) provides a principled method for sampling from a distribution that approximates combining pretraining at one scale with fine-tuning at another. This decoupling reveals: scaling up pre-training tends to improve factuality, while scaling up fine-tuning tends to improve helpfulness.
The mechanism: pretraining builds knowledge (factual storage across the parameter space), while fine-tuning shapes behavior (how that knowledge is surfaced in response to queries). These operate on different aspects of the model. Since Why does reasoning training help math but hurt medical tasks?, the decoupling has an architectural basis — pretraining enriches lower-layer knowledge, fine-tuning modifies upper-layer behavior.
A special case, LM up-scaling, avoids resource-intensive fine-tuning of large pretrained models by ensembling them with small fine-tuned models — essentially emulating the result of fine-tuning the large model. This consistently improves helpfulness and factuality across Llama, Llama-2, and Falcon families without additional training. The practical implication: you can get the benefits of fine-tuning a 70B model by fine-tuning a 7B model and combining the signals.
EFT also enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. This is relevant to Does preference optimization damage conversational grounding in large language models? — if helpfulness and harmlessness are adjustable at test time, the fixed trade-off imposed by RLHF may be unnecessary.
The decomposition challenges the assumption that a model's capabilities are monolithic. Factual knowledge and behavioral alignment are not only distinct — they scale differently and can be independently manipulated. This has implications for deployment: rather than training one large, fully-tuned model, a pipeline of specialized components (large pretrained for knowledge + small tuned for behavior) may be more efficient and more controllable.
Source: Training Fine Tuning
Related concepts in this collection
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
architectural basis for the decoupling: knowledge in lower layers (PT) vs behavior in upper layers (FT)
-
Can decoding-time tuning preserve knowledge better than weight fine-tuning?
Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
similar decomposition philosophy: separate knowledge from behavioral adaptation
-
Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
if behavioral traits are adjustable at test time, fixed alignment trade-offs may be avoidable
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
consistent: FT surfaces existing capabilities rather than creating new ones
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
shared methodology of decomposing scaling into independent dimensions: EFT decouples pretraining scale (factuality) from fine-tuning scale (helpfulness), while conditional scaling laws decouple architecture from training compute; both reveal that treating model performance as a single scalar hides independently optimizable axes
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
scaling fine-tuning improves helpfulness while scaling pretraining improves factuality — these are decoupled training-stage effects