Do networks recover from forgetting before re-encountering documents?
When language models train cyclically on repeated documents, do they anticipate upcoming material and recover from forgetting in advance? This challenges the standard catastrophic-interference narrative about sequential training.
The default story of sequential training is catastrophic interference: forgetting increases monotonically as a network trains on a sequence of different documents. This paper studies a structured non-IID setting — documents presented cyclically in a fixed, repeated order — and finds a remarkable opposite phenomenon: anticipatory recovery. Networks recover from the forgetting of a document before they encounter it again in the cycle, as if pre-positioning themselves for what's coming. The effect emerges and becomes more robust as the model scales up parameters, and only when each document is well-fitted before moving on; visualizations of weights, activations, and gradients show clear temporal structure.
The keeper is that over-parameterized networks in structured, repeating environments behave unlike the catastrophic-interference picture — they exploit the temporal regularity of the training schedule to organize their weights anticipatorily. This is closer to how humans learn from structured, repeating material than the random-sampling default of LLM pretraining.
This adds a training-dynamics surprise to the vault. It connects to the broader theme that structure in the learning process matters, alongside Does teaching question patterns before document training improve knowledge access? (order of encoding shapes outcomes) and Is LLM forgetting really knowledge loss or alignment loss? (forgetting is often recoverable, not destruction) — both complicate the simple catastrophic-forgetting narrative.
Inquiring lines that use this note as a source 7
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes knowledge seeding equivalent to hippocampal replay in the brain?
- How does training order affect knowledge acquisition in language models?
- Is forgetting in language models reversible or permanent knowledge loss?
- How do weight visualizations reveal temporal structure in cyclic training?
- Can training order and structure shape what networks retain and learn?
- Can we unlearn memorized text by finetuning only high-gradient weights?
- Can document repetition accidentally memorize sensitive information instead of learning?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does teaching question patterns before document training improve knowledge access?
Standard LLM training encodes documents first, then teaches QA patterns. But does this order matter? Exploring whether reversing the sequence—teaching how knowledge gets queried before encoding it—could unlock better factual recall.
both show training *structure/order* shapes what the network learns and retains
-
Is LLM forgetting really knowledge loss or alignment loss?
When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.
both complicate the catastrophic-forgetting narrative; forgetting is structured and often recoverable
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training
- Spurious Forgetting in Continual Learning of Language Models
- Schema-learning and rebinding as mechanisms of in-context learning and emergence
- Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
- Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories
- A Mechanistic Analysis of Looped Reasoning Language Models
- How new data permeates LLM knowledge and how to dilute it
- Using Computational Models to Test Syntactic Learnability
Original note title
networks trained on cyclically repeated documents anticipate and recover from forgetting before re-encountering them and this emerges with scale