Can we unlearn memorized text by finetuning only high-gradient weights?
This explores whether targeted unlearning is feasible — if memorized text lives in a specific, locatable set of weights, can we erase it by fine-tuning only those high-gradient parameters rather than retraining the whole model?
This explores whether targeted unlearning is feasible: if memorized text lives in a specific, locatable set of weights, can we erase it by fine-tuning only those high-gradient parameters? The corpus says the first half of the premise holds up surprisingly well. When a model memorizes a paragraph verbatim, it leaves a distinctive fingerprint — larger gradients concentrated in lower layers, plus a specific low-layer attention head that fixates on rare tokens Where does a model store memorized paragraphs?. That's exactly the localization an unlearning method would want: memorization isn't smeared evenly across the network, it pools in a few identifiable places, which makes it targetable.
But the interesting tension is that lower layers are also where general knowledge is stored — and that's where things get risky. Work on proxy-tuning found that direct fine-tuning corrupts knowledge storage in lower layers specifically, while leaving the base weights frozen and steering only at decoding time preserves that knowledge far better Can decoding-time tuning preserve knowledge better than weight fine-tuning?. So the very region you'd surgically edit to remove a memorized passage is the region most prone to collateral damage. Aggressively fine-tuning high-gradient low-layer weights to forget one paragraph could quietly degrade unrelated capabilities.
This is why the most promising unlearning approaches may not touch weights at all. Representation fine-tuning (ReFT) intervenes on frozen hidden representations instead of updating parameters, matching or beating weight-based methods like LoRA with 10–50x fewer parameters Can editing hidden representations beat weight updates for finetuning?. The same logic shows up in research on why models ignore their context: textual prompting alone can't override a strong learned association — only causal intervention in the representations does the job Why do language models ignore information in their context?. If suppressing a strong prior requires representation-level surgery rather than re-weighting, the same may be true for erasing one.
The deeper catch is whether "the memorized text" is even confined to the weights you'd edit. Models can reconstruct censored or never-stated information by piecing together implicit hints scattered across training data Can LLMs reconstruct censored knowledge from scattered training hints?. So even if you cleanly zero out the high-gradient weights holding a verbatim passage, the model might re-derive its content from distributed traces elsewhere. And training dynamics are stranger than monotonic forgetting suggests — networks trained on cyclic data show anticipatory recovery, restoring "forgotten" documents before re-encountering them Do networks recover from forgetting before re-encountering documents?, a hint that forgetting in these systems is not a stable one-way street.
The honest answer: yes, memorization is localizable enough to make high-gradient targeting a real strategy — that's the genuinely encouraging finding here. But "finetuning only high-gradient weights" inherits two problems the corpus flags clearly: those weights overlap with general knowledge storage, and the memorized content may not be fully contained in them. The frontier is shifting toward representation-level intervention precisely because weight editing is blunter than the localization picture first makes it look.
Sources 6 notes
Memorized paragraphs leave a distinctive fingerprint in GPT-Neo: larger gradients in lower layers, concentration in a specific low-layer attention head attending to rare tokens, and dependence on a few early-prefix tokens. This localization makes memorization targetable for unlearning.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Language models perform out-of-context reasoning across the full training distribution, reconstructing information never explicitly stated in any single document. Experiments show models can infer city identities from scattered distance relationships and apply them downstream without in-context learning.
Language models finetuned on cyclically repeated documents exhibit anticipatory recovery—restoring performance on a document before encountering it again—a phenomenon that emerges and strengthens with model scale, contradicting monotonic catastrophic interference.