Does finetuning facts into weights overwrite existing model capabilities?
This explores whether writing new facts directly into a model's weights through fine-tuning damages or erases what the model already knew — and what the corpus suggests about where that damage lives and how to avoid it.
This explores whether writing new facts directly into a model's weights through fine-tuning damages or erases what the model already knew. The short answer the corpus keeps circling back to: yes, in-weight fine-tuning genuinely does overwrite prior knowledge — but the more interesting finding is *why*, and that fixing the symptom (cramming facts in) is often the wrong move entirely. One paper offers a formal proof that a model's capacity to memorize facts is bounded by its size, so every new fact you force into the weights competes for finite room — and the authors show directly that in-weight fine-tuning degrades general capability by overwriting what was there before Can models store unlimited facts without growing larger?. Their alternative is telling: don't store the fact at all, let the model reach for a tool, and factual recall becomes unbounded without touching the weights.
The damage isn't evenly spread across the model — it's localized, and that's the key to avoiding it. Knowledge appears to live in the lower layers while behavior and style live in the upper layers, so when you fine-tune the whole stack to teach helpfulness or a new fact, you're collaterally rewriting the layers that hold what the model knows Do pretraining and fine-tuning scale independently in language models?. This is why methods that *leave the weights alone* keep winning on knowledge tasks: proxy-tuning shifts the output distribution at decoding time and beats direct fine-tuning on knowledge benchmarks precisely because it never corrupts those lower-layer stores Can decoding-time tuning preserve knowledge better than weight fine-tuning?, and representation fine-tuning intervenes on frozen hidden states rather than updating parameters, matching or beating LoRA with a fraction of the footprint Can editing hidden representations beat weight updates for finetuning?.
There's a deeper reframe lurking here that you might not have gone looking for. A whole cluster of the corpus argues that post-training mostly doesn't *create* capability — it surfaces or selects what pretraining already laid down. Base models already carry latent reasoning that minimal training merely elicits Do base models already contain hidden reasoning ability?, and RL post-training teaches a model *when* to deploy reasoning rather than *how* to do it Does RL post-training create reasoning or just deploy it?. If capability is largely pre-existing and stored, then aggressive fine-tuning isn't adding much — it's mostly risking what's already there.
And the overwriting can get actively destructive, not just lossy. Train on the wrong signal and the new behavior doesn't sit politely beside the old — it contaminates it: overly hard RL samples teach degenerate shortcuts that spread into pre-existing capabilities Do overly hard RLVR samples actually harm model capabilities?, and even ordinary RL post-training quietly collapses the format diversity a model inherited from pretraining, amplifying one dominant pattern while suppressing alternatives Does RL training collapse format diversity in pretrained models?. So the worry isn't only "will my new fact crowd out an old one" — it's that the *process* of updating weights can degrade capabilities that have nothing to do with what you were trying to teach.
The throughline: if you want a model to know something new, the corpus's bet is to keep it *out* of the weights — reach for tools, intervene at decoding or representation level, or accept that what you're teaching is usually deployment, not knowledge. The weights are where the model's existing competence is stored, and overwriting them is exactly the cost you're trying not to pay.
Sources 8 notes
A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.