Why does parameter-efficient tuning scaling fail to improve finetuning performance?

This explores why adding more trainable parameters to parameter-efficient tuning (PET) methods like LoRA doesn't reliably buy you better finetuning — and what the corpus suggests is actually doing the work instead.

This explores why scaling up parameter-efficient tuning (PET) — adding more trainable adapter parameters — fails to translate into better finetuning, and what the research says is the real lever instead. The most direct answer comes from the scaling-law work: finetuning follows a *multiplicative* power law where the size of the base model matters far more than anything you do at the finetuning stage, and pushing up the number of PET parameters yields minimal returns How should finetuning scale with model and data size?. In other words, PET capacity isn't the bottleneck — the knowledge already sitting in the pretrained weights is. You can't add adapter parameters your way into capability the base model never had.

Why would extra parameters be wasted? Because finetuning and pretraining touch different parts of the model. One line of work shows the two scale almost independently: pretraining enriches factual knowledge stored in lower layers, while finetuning mostly modifies *behavior expression* in upper layers — helpfulness, style, format Do pretraining and fine-tuning scale independently in language models?. So a bigger adapter is just a bigger knob on a layer that was only ever going to adjust surface behavior. This explains a recurring observation that finetuning makes outputs *look* right without making them right: supervised finetuning on optimization problems produces clean JSON and valid structure while the underlying solutions remain physically infeasible — the model learns the surface features of good answers, not the reasoning to construct them Does supervised fine-tuning actually improve reasoning on optimization problems?. More parameters sharpen the costume, not the competence.

The more interesting twist is that *where* you intervene beats *how much* you tune. Representation finetuning (ReFT) edits frozen hidden representations rather than updating weights, and its low-rank variant beats LoRA across reasoning and instruction benchmarks while using 10–50× fewer parameters Can editing hidden representations beat weight updates for finetuning?. That's the inverse of the scaling intuition: a smaller, better-placed intervention wins. The same theme runs through decoding-time methods — proxy-tuning leaves base weights untouched and closes most of the alignment gap while actually *outperforming* direct finetuning on knowledge tasks, because direct weight updates corrupt the knowledge stored in lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?. SoftCoT makes the same bet from another angle: freeze the backbone, delegate the new reasoning to a tiny auxiliary model, and you avoid catastrophic forgetting entirely Can continuous reasoning avoid forgetting in instruction-tuned models?.

There's also a sign that the *signal*, not the parameter count, decides whether finetuning sticks. DPO with explicit negative examples beats plain SFT for small models on function-calling precisely because it targets the rigid format failures SFT can't fix Can small models match large models on function calling?, and when multiple tasks are involved, structurally isolating each task's core parameters beats throwing everything into one undifferentiated finetune Can isolating task-specific parameters prevent multi-task fine-tuning interference?. The consistent picture across all of these: finetuning performance is gated by base-model knowledge, by which layers you touch, and by the quality of your training signal — none of which a larger adapter addresses. Scaling PET parameters fails because it's optimizing the one dimension that turns out not to matter.

Sources 8 notes

How should finetuning scale with model and data size?

Systematic experiments across 1B–16B models reveal finetuning follows a power-based multiplicative scaling law. Larger base models improve finetuning more than more pretraining data, while increasing PET parameters provides minimal benefit.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Why does parameter-efficient tuning scaling fail to improve finetuning performance?

Sources 8 notes

Next inquiring lines