Do networks recover from forgetting before re-encountering documents?

When language models train cyclically on repeated documents, do they anticipate upcoming material and recover from forgetting in advance? This challenges the standard catastrophic-interference narrative about sequential training.

Synthesis note · 2026-06-03 · sourced from Knowledge Graphs

The default story of sequential training is catastrophic interference: forgetting increases monotonically as a network trains on a sequence of different documents. This paper studies a structured non-IID setting — documents presented cyclically in a fixed, repeated order — and finds a remarkable opposite phenomenon: anticipatory recovery. Networks recover from the forgetting of a document before they encounter it again in the cycle, as if pre-positioning themselves for what's coming. The effect emerges and becomes more robust as the model scales up parameters, and only when each document is well-fitted before moving on; visualizations of weights, activations, and gradients show clear temporal structure.

The keeper is that over-parameterized networks in structured, repeating environments behave unlike the catastrophic-interference picture — they exploit the temporal regularity of the training schedule to organize their weights anticipatorily. This is closer to how humans learn from structured, repeating material than the random-sampling default of LLM pretraining.

This adds a training-dynamics surprise to the vault. It connects to the broader theme that structure in the learning process matters, alongside Does teaching question patterns before document training improve knowledge access? (order of encoding shapes outcomes) and Is LLM forgetting really knowledge loss or alignment loss? (forgetting is often recoverable, not destruction) — both complicate the simple catastrophic-forgetting narrative.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 130 in 2-hop network ·dense cluster Open in graph ↗

Do networks recover from forgetting before re-en… Does teaching question patterns before document tr… Is LLM forgetting really knowledge loss or alignme…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does teaching question patterns before document training improve knowledge access? Standard LLM training encodes documents first, then teaches QA patterns. But does this order matter? Exploring whether reversing the sequence—teaching how knowledge gets queried before encoding it—could unlock better factual recall.
both show training *structure/order* shapes what the network learns and retains
Is LLM forgetting really knowledge loss or alignment loss? When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.
both complicate the catastrophic-forgetting narrative; forgetting is structured and often recoverable

Do networks recover from forgetting before re-encountering documents?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4