Knowledge Retrieval and RAG LLM Reasoning and Architecture

Can retrieval knowledge fit into a small trained model?

Explores whether the information stored in large non-parametric retrieval datastores can be compressed into a compact parametric decoder without losing long-tail knowledge or inference speed benefits.

Note · 2026-04-18 · sourced from Memory

Memory Decoder (2508.09874) addresses a fundamental tension in domain adaptation: RAG provides flexibility but adds inference latency through nearest-neighbor search; domain-adaptive pretraining embeds knowledge in weights but requires costly full-parameter training and risks catastrophic forgetting. Memory Decoder proposes a third path — compress the knowledge stored in large non-parametric datastores into a compact parametric model.

The approach pretrains a small transformer decoder to imitate the output distributions of a kNN-LM retriever. Once trained, it plugs into any language model sharing the same tokenizer via simple output interpolation — no model-specific modifications needed. The pretrained LM and Memory Decoder process the same input context in parallel, and their distributions are interpolated at output time.

Two capabilities validate the compression hypothesis: (1) Long-tail knowledge — for factual information like "Jacobi" and "1906," Memory Decoder assigns dramatically higher probabilities than the base model (68.94% vs 0.12%), successfully capturing the memorization benefits of non-parametric methods. (2) Semantic coherence — for function words and logical continuations, Memory Decoder maintains probabilities closer to the base model rather than following kNN-LM's distortions, preserving coherent language modeling that pure retrieval sacrifices.

This bridges a gap in the How do knowledge injection methods trade off flexibility and cost?: Memory Decoder is a modular adapter that inherits retrieval's long-tail strength without retrieval's inference cost. It demonstrates that the information content of a large datastore can be compressed into orders-of-magnitude fewer parameters — suggesting retrieval-augmented knowledge may be more redundant than its datastore size implies.

The plug-and-play capability also connects to Can neural memory modules scale language models beyond attention limits? — both approaches add external memory as a parallel module rather than modifying the base model, but Memory Decoder targets domain knowledge while Titans targets sequence length.

Source: Memory

Original note title

compressing retrieval into a small parametric decoder eliminates datastore search at inference while preserving long-tail knowledge — a third path between RAG and fine-tuning