What mechanism enables models to retrieve from long context?
Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?
Across 4 model families, 6 scales, and 3 types of finetuning, a specific type of attention heads — retrieval heads — are largely responsible for retrieving relevant information from arbitrary locations in long context. Five key properties:
- Universal: All explored models with long-context capability have retrieval heads.
- Sparse: Less than 5% of attention heads are retrieval heads.
- Intrinsic: They already exist in models pretrained with short context. Continual pretraining to 32-128K extends the same set of heads — no new retrieval mechanisms emerge.
- Dynamically activated: In Llama-2 7B, 12 retrieval heads always attend to required information regardless of context changes; remaining retrieval heads activate selectively by context.
- Causal: Completely pruning retrieval heads causes hallucination; pruning random non-retrieval heads has no effect on retrieval ability.
The CoT connection: retrieval heads strongly influence chain-of-thought reasoning, where the model must frequently refer back to the question and previously-generated context. Tasks where the model directly generates from intrinsic knowledge are less impacted by retrieval head pruning.
This connects the factuality problem to the reasoning architecture: Why does reasoning training help math but hurt medical tasks? describes layer-level separation. Retrieval heads describe head-level specialization within this architecture — a sparse subset of the attention mechanism bridges stored knowledge to ongoing generation.
The practical implication for RAG systems: retrieval heads explain why models can struggle with long-context retrieval despite having the information in context. If retrieval heads are partially activated or not activated for a given needle, the model hallucinates. This is a mechanistic explanation for the Needle-in-a-Haystack failure mode.
Source: MechInterp
Related concepts in this collection
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
layer-level separation; retrieval heads add head-level specialization within this architecture
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
retrieval heads are the mechanism that bridges encoding to generation for in-context information; their failure is one cause of the encoding≠generation gap
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
retrieval heads are the mechanistic substrate that enables attending to thought anchors during CoT
-
Do transformers hide reasoning before producing filler tokens?
Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.
explains why retrieval heads are necessary: if intermediate reasoning representations are overwritten in later layers, the model must retrieve from earlier positions via these sparse attention heads
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
retrieval heads provide a mechanistic lens on CoT faithfulness: if retrieval heads fail to attend to a reasoning step, that step cannot causally influence subsequent generation regardless of its logical validity; CoT faithfulness requires not just generating correct steps but having retrieval heads bridge them into downstream computation
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
retrieval heads are a universal sparse intrinsic mechanism for long-context factuality — pruning them causes hallucination