How do newly learned facts become accessible after gradient updates?
This explores what actually happens inside a model when fine-tuning writes a new fact in — where it lands, whether it can be recalled cleanly, and why the alternatives (editing activations, decoding-time tricks, external tools) often beat gradient updates at making knowledge usable.
This explores what actually happens inside a model when a gradient update tries to install a new fact — and the corpus's most surprising answer is that gradient updates may be the wrong place to look for stored facts at all. One line of work argues that transformers don't keep knowledge in tidy retrievable slots; the residual stream *transmits* knowledge as a flow of activations during generation rather than warehousing it Do transformer models store knowledge or generate it continuously?. If that's right, a newly learned fact becomes 'accessible' only when it can re-enter that flow at the right moment — which is exactly why edited facts are notoriously brittle and context-dependent.
That reframing explains a cluster of findings about the *cost* of writing facts in by weight update. In-weight memorization is provably bounded by model size, and pushing facts in through fine-tuning overwrites prior knowledge and degrades general capability — so a new fact can become accessible only by quietly displacing old ones Can models store unlimited facts without growing larger?. Decoding-time work sharpens the point: direct fine-tuning corrupts knowledge storage in the *lower* layers, whereas proxy-tuning that leaves base weights untouched preserves factual recall while still shifting reasoning and style Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The lesson is that the layers where facts live and the layers where gradient updates do their damage overlap badly.
So a lot of the corpus is about making facts accessible *without* paying that price. Representation fine-tuning leaves weights frozen and instead learns a small intervention on hidden activations — steering the flow rather than rewriting the store — and gets 10–50× better parameter efficiency than LoRA Can editing hidden representations beat weight updates for finetuning?. Tool use goes further, decoupling factual recall from parameters entirely so new facts live in an external store and are retrieved through a simple learned circuit Can models store unlimited facts without growing larger?. Bidirectional RAG makes this dynamic: a generated answer becomes a newly accessible 'fact' only after it passes entailment, attribution, and novelty checks before being written back to the corpus — accessibility gated by verification rather than by gradient descent Can RAG systems safely learn from their own generated answers?.
When you *do* update weights, the corpus offers a mechanistic picture of how the change is shaped. RL-style updates touch only 5–30% of parameters, and those updates are sparse but nearly full-rank and remarkably consistent across random seeds — meaning the model has structural 'places' it puts new learning rather than scattering it arbitrarily Does reinforcement learning update only a small fraction of parameters?. Whether that newly written capability stays *usable* afterward depends on drift: staying close to the base distribution (low KL drift) preserves the plasticity needed to keep learning, while parameter-heavy updates stall when the domain shifts Does staying close to the base model preserve learning ability?. And the data you train on matters as much as the mechanism — gradient-similarity selection shows that a small, well-chosen slice of examples installs a target skill better than the full set, because some training data actively pushes the model's reasoning away from the fact you wanted Can we train better models on less data?.
The thing you didn't know you wanted to know: there's a quieter route to making facts accessible that bypasses persistent updates entirely. Context-engineering work treats the prompt itself as an evolving 'playbook' — newly learned material is curated incrementally so it stays retrievable across iterations without the brevity bias and collapse that compression causes Can context playbooks prevent knowledge loss during iteration?. Read together, the corpus suggests 'accessible after a gradient update' is the hard case, not the default one — facts become reliably usable more often by intervening on activations, externalizing to tools, or curating context than by writing them into the weights.
Sources 9 notes
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.
Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.
The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.