Knowledge Retrieval and RAG LLM Reasoning and Architecture

Can retrieval knowledge fit into a small trained model?

Explores whether the information stored in large non-parametric retrieval datastores can be compressed into a compact parametric decoder without losing long-tail knowledge or inference speed benefits.

Note · 2026-04-18 · sourced from Memory
How should retrieval and reasoning integrate in RAG systems? How do language models learn to think like humans? How should researchers navigate LLM reasoning research?

Memory Decoder (2508.09874) addresses a fundamental tension in domain adaptation: RAG provides flexibility but adds inference latency through nearest-neighbor search; domain-adaptive pretraining embeds knowledge in weights but requires costly full-parameter training and risks catastrophic forgetting. Memory Decoder proposes a third path — compress the knowledge stored in large non-parametric datastores into a compact parametric model.

The approach pretrains a small transformer decoder to imitate the output distributions of a kNN-LM retriever. Once trained, it plugs into any language model sharing the same tokenizer via simple output interpolation — no model-specific modifications needed. The pretrained LM and Memory Decoder process the same input context in parallel, and their distributions are interpolated at output time.

Two capabilities validate the compression hypothesis: (1) Long-tail knowledge — for factual information like "Jacobi" and "1906," Memory Decoder assigns dramatically higher probabilities than the base model (68.94% vs 0.12%), successfully capturing the memorization benefits of non-parametric methods. (2) Semantic coherence — for function words and logical continuations, Memory Decoder maintains probabilities closer to the base model rather than following kNN-LM's distortions, preserving coherent language modeling that pure retrieval sacrifices.

This bridges a gap in the How do knowledge injection methods trade off flexibility and cost?: Memory Decoder is a modular adapter that inherits retrieval's long-tail strength without retrieval's inference cost. It demonstrates that the information content of a large datastore can be compressed into orders-of-magnitude fewer parameters — suggesting retrieval-augmented knowledge may be more redundant than its datastore size implies.

The plug-and-play capability also connects to Can neural memory modules scale language models beyond attention limits? — both approaches add external memory as a parallel module rather than modifying the base model, but Memory Decoder targets domain knowledge while Titans targets sequence length.


Source: Memory

Original note title

compressing retrieval into a small parametric decoder eliminates datastore search at inference while preserving long-tail knowledge — a third path between RAG and fine-tuning