When should full-parameter post-training be used instead of LoRA adaptation?

This explores when you actually need to update all of a model's weights (full-parameter post-training) versus the cheaper route of training a small add-on adapter (LoRA) — and the corpus has a surprisingly opinionated answer.

This explores when full-parameter post-training earns its much higher cost over LoRA-style lightweight adaptation. The short version the corpus keeps circling back to: less often than you'd think — because much of what people reach for full fine-tuning to teach turns out to be format and behavior, not new knowledge, and those are exactly what small adapters handle well.

The sharpest data point is that a 1.5B model trained with LoRA alone matched far larger full-parameter RL models on reasoning tasks Can small models reason well by just learning output format?. The interpretation there is that reinforcement learning mostly teaches a model how to *organize* its output, not new facts — meaning reasoning and knowledge storage are separable, and the reasoning half is cheap to adapt. This lines up with a structural finding from the other direction: even when you do run full RL, it only ends up modifying 5–30% of parameters, in sparse but nearly full-rank subnetworks that are consistent across random seeds Does reinforcement learning update only a small fraction of parameters?. So 'full-parameter' training is, in practice, already doing something closer to targeted surgery — which weakens the case that you needed all the parameters unfrozen to begin with.

The real argument *against* full fine-tuning, though, is damage. Updating all weights directly corrupts knowledge stored in a model's lower layers; decoding-time proxy-tuning closes 88–91% of the alignment gap while *beating* direct fine-tuning on knowledge tasks precisely because it leaves the base weights untouched Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Catastrophic forgetting, on this view, isn't an inherent cost of adaptation but a misallocation: route task-specific lessons into prompts or fast context and keep parameter updates minimal, and you reach equivalent performance faster with far less forgetting Can splitting adaptation into two channels reduce forgetting?. There's even a method that beats LoRA itself by tuning only the singular values of weight matrices into composable expert vectors Can models dynamically activate expert skills at inference time? — pushing the frontier toward *less* invasive, not more.

So when *would* you go full-parameter? The corpus implies the honest answer is: when you genuinely need to reshape the model's internal knowledge or its base distribution, not just its output style. There's a quiet warning here too — full RL post-training collapses the diversity of formats a pretrained model can produce, locking onto a single dominant one within the first epoch regardless of whether it's the best Does RL training collapse format diversity in pretrained models?. That homogenizing pressure is a cost you pay with deep training and largely avoid with isolated adapters. And if your real problem is multi-task interference, the fix isn't training harder but isolating each task's core parameters and freezing them while merging the rest Can isolating task-specific parameters prevent multi-task fine-tuning interference?.

The thing you didn't know you wanted to know: the field is steadily reframing 'full vs. LoRA' not as a power-vs-efficiency tradeoff but as a *which capability am I actually changing* question. If you're adapting behavior, reasoning format, or style, the lightweight methods now win on both cost and knowledge preservation — and full-parameter training increasingly looks like the option you choose only when you've confirmed the cheaper, less destructive routes can't reach the knowledge you need to move.

Sources 7 notes

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

When should full-parameter post-training be used instead of LoRA adaptation?

Sources 7 notes

Next inquiring lines