How much performance is lost when converting pretrained checkpoints versus training from scratch?

This reads the question as: when you adapt a pretrained model rather than build one fresh, how much of the original model's capability gets damaged in the process — and the corpus answers less by comparing against from-scratch training than by exposing the hidden costs of touching pretrained weights at all.

This explores what gets lost when you take a pretrained checkpoint and convert it — by fine-tuning, RL, or instruction-tuning — into a task-specialized model, rather than training that capability from the ground up. The corpus doesn't offer a clean from-scratch benchmark, but it tells a sharper story: the loss isn't measured in a single accuracy number, it shows up as silent corruption of knowledge the base model already had. Direct fine-tuning literally damages where facts are stored — it corrupts knowledge in the lower layers — which is why decoding-time proxy-tuning, which never updates the base weights, closes 88-91% of the alignment gap while actually beating direct fine-tuning on knowledge tasks Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The cost of conversion, in other words, is partly self-inflicted by the conversion method.

That theme repeats across very different adaptation techniques. Representation fine-tuning makes the same bet from another angle: freeze the model's hidden representations and intervene on them instead of rewriting weights, and you get 10-50x better parameter efficiency than LoRA while doing better on reasoning and instruction-following Can editing hidden representations beat weight updates for finetuning?. The recurring lesson is that the pretrained checkpoint is a fragile asset, and the more invasively you overwrite it, the more of it you lose. Methods that win are the ones that leave the original intact and steer it from the outside.

Reinforcement learning shows the steepest hidden losses. RL post-training doesn't broadly improve a model so much as collapse it onto a single dominant format inherited from pretraining, suppressing the other formats the base model could produce — and which format wins depends on scale, not performance Does RL training collapse format diversity in pretrained models?. Push RL on problems that are too hard and it's worse than wasted effort: the model learns degenerate shortcuts that contaminate capabilities it already had, so you come out behind where you started Do overly hard RLVR samples actually harm model capabilities?. Even the reward shape matters — binary correctness rewards provably degrade calibration, teaching the model to guess confidently wrong Does binary reward training hurt model calibration?.

There's also a surprising flip side: sometimes converting a checkpoint changes far less than you'd assume. Instruction tuning, it turns out, mostly teaches a model the shape of the output space, not the task — models trained on semantically empty or deliberately wrong instructions perform almost identically to those trained on correct ones (43% vs 42.6%) Does instruction tuning teach task understanding or output format?. So part of what 'conversion' adds is cosmetic formatting riding on capabilities that were already latent in the pretrained weights, which is exactly why the destructive methods feel like such a bad trade.

The thing you didn't know you wanted to know: the most interesting answer to 'how much do you lose by converting?' is that the smartest practitioners route around the question entirely. Branch-Train-MiX trains domain experts in parallel and then merges their feed-forward layers into a mixture-of-experts with learned routing, getting better accuracy-efficiency tradeoffs than synchronized training Can asynchronous expert training beat synchronized distributed LLM training? — preserving each specialist instead of overwriting one model repeatedly. Pretrained checkpoints are valuable precisely because their internals are hard-won and easy to break, and the whole frontier of adaptation research is a search for ways to add capability without paying the conversion tax.

Sources 7 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can asynchronous expert training beat synchronized distributed LLM training?

Branch-Train-MiX trains domain experts in parallel without synchronization overhead, merges their feed-forward parameters as MoE experts, and learns token-level routing, achieving better accuracy-efficiency tradeoffs than synchronized training or routing-free merging.

How much performance is lost when converting pretrained checkpoints versus training from scratch?

Sources 7 notes

Next inquiring lines