What happens to base model capabilities when you apply finetuning?

This explores what fine-tuning actually does to the abilities a model already learned in pretraining — whether it adds, sharpens, hides, or quietly damages them.

This reads the question as: when you take a capable base model and fine-tune it, do you get a strictly better model, or do you trade something away? The corpus's surprising consensus is that fine-tuning rarely *creates* raw capability — more often it reshapes how capability gets expressed, and sometimes it degrades what was already there. One useful frame splits the model in two: pretraining scale drives factual knowledge stored in lower layers, while fine-tuning scale drives behavioral helpfulness expressed in upper layers Do pretraining and fine-tuning scale independently in language models?. So fine-tuning is largely a behavior-shaping operation layered on top of knowledge it didn't put there.

That framing explains a recurring finding about reasoning: the ability is usually already latent in the base model, and RL-style post-training mostly teaches *when* to deploy it rather than *how* to do it — hybrid models recover 91% of the gains just by routing tokens, and reasoning activation vectors exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?. But this isn't universal: for standard reasoning RL activates what's already there, while for complex multi-step planning it can generate genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. So whether fine-tuning adds capability depends on how far the task sits from what the base model already knows.

The darker side is that fine-tuning can actively corrode base capabilities. Fine-tuning can make a model's chain-of-thought *performative* — the reasoning steps stop causally driving the answer, so the model looks like it's thinking while the explanation has come loose from the output Does fine-tuning disconnect reasoning steps from final answers?. RL fine-tuning can sharpen memorization and template-matching rather than install real procedures, which collapses on out-of-distribution variants Do fine-tuned language models actually learn optimization procedures?. Push the training signal too hard — overly difficult RLVR samples — and models learn degenerate shortcuts that contaminate pre-existing skills Do overly hard RLVR samples actually harm model capabilities?. Fine-tuning also narrows the model: RL converges on a single dominant pretraining format and suppresses the alternatives within the first epoch Does RL training collapse format diversity in pretrained models?, and preference tuning shifts diversity — cutting it in code while raising it in creative writing, depending on what each domain rewards Does preference tuning always reduce diversity the same way?.

The most direct answer to "does it forget?" is yes, and the corpus's mitigations are revealing because they all work by *touching fewer weights*. SoftCoT freezes the entire backbone and trains a tiny auxiliary model, so continuous reasoning gets added without disturbing pretrained knowledge Can continuous reasoning avoid forgetting in instruction-tuned models?. Parameter isolation freezes the core regions each task depends on and only merges the non-core parts, beating standard multi-task fine-tuning Can isolating task-specific parameters prevent multi-task fine-tuning interference?. Transformer² goes further, tuning only the singular values of weight matrices to produce composable expert vectors that mix at inference without interference Can models dynamically activate expert skills at inference time?. The pattern across all three: the more you let fine-tuning rewrite the base weights, the more base capability you risk losing — so the frontier is figuring out how to add behavior while leaving the foundation intact.

The thing you might not have expected to learn: fine-tuning's failures often aren't visible as lower accuracy. A model can score the same while its reasoning has quietly detached from its answers, its format diversity has collapsed, or its apparent skill is brittle memorization that shatters the moment the test set shifts. "What happens to capabilities" isn't just a question of how much — it's about what kind of capability survives, and whether the survivor is the real thing or a convincing imitation.

Sources 11 notes

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

What happens to base model capabilities when you apply finetuning?

Sources 11 notes

Next inquiring lines