How do layer-wise versus parameter-wise merging strategies affect information retention?

This reads the question as asking where in a model knowledge actually lives — and whether editing weights directly (parameter-wise) versus working at the level of specific layers or separate channels changes how much gets preserved; the corpus doesn't have papers on model-merging per se, but it has sharp findings on where knowledge is stored and which kinds of edits corrupt it.

This explores whether *where* you intervene in a model — at the level of raw parameters versus specific layers or separate adaptation channels — determines how much knowledge survives the edit. Worth saying up front: the collection doesn't contain papers on model-merging strategies in the weight-averaging sense the term usually implies. But it has something more useful for the underlying question — direct evidence that knowledge is *not* stored uniformly across a model, which is exactly why the layer-versus-parameter distinction matters at all.

The clearest signal comes from proxy-tuning, which found that direct weight fine-tuning corrupts knowledge storage specifically in a model's *lower* layers, while a decoding-time approach that leaves base weights untouched preserves that knowledge and shifts mainly reasoning and style Can decoding-time tuning preserve knowledge better than weight fine-tuning?. That's the layer-wise insight in disguise: factual knowledge concentrates in particular layers, so any merging or tuning that overwrites those layers wholesale will bleed information that a more surgical, channel-separated intervention keeps intact. Retention isn't a property of *how much* you change — it's a property of *what you change*.

That reframing — forgetting as misallocation rather than inevitable cost — is made explicit by the slow-weights/fast-context split, which routes task-specific lessons into prompts while keeping parameter updates minimal, reaching equal performance faster and with far less catastrophic forgetting Can splitting adaptation into two channels reduce forgetting?. The same logic runs through Wide & Deep, where memorization and generalization live in physically separate components so each can specialize without trampling the other Can one model memorize and generalize better than two?. In both cases the lesson is identical to the merging question: keep distinct kinds of information in distinct places, and combining or updating one doesn't destroy the other.

The collection also warns about the failure mode you'd expect from naive blending. COMEDY's single-model memory compression follows an inverted-U curve — past a point, continuously reprocessing and consolidating everything into one representation degrades performance *below* having no memory at all, through misgrouping and context loss Can a single model replace retrieval for long-term conversation memory?. Autonomous memory folding avoids that fate precisely by consolidating into *structured, separated* schemas rather than one undifferentiated blob Can agents compress their own memory without losing critical details?. Crude fusion loses information; structure-preserving fusion doesn't.

So the thread the corpus actually offers: information retention tracks *structural respect* far more than the parameter-versus-layer label itself. Interventions that honor where knowledge already lives — specific layers, separate channels, distinct schemas — retain it; interventions that flatten everything into one parameter space tend to corrupt the lower-layer factual store first. If you came wanting weight-merging benchmarks, the collection won't give them — but it will tell you why the distinction you're asking about exists in the first place.

Sources 5 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can one model memorize and generalize better than two?

Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.

Can a single model replace retrieval for long-term conversation memory?

COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

How do layer-wise versus parameter-wise merging strategies affect information retention?

Sources 5 notes

Next inquiring lines