How do model priors enable targeted context queries without full attention?
This explores how a model's pre-trained knowledge (its priors) lets it find the right needle in a long context using only a sparse slice of its machinery — rather than every attention head scanning everything — and the corpus suggests the story is one of sparse, intrinsic mechanisms in productive tension with the priors themselves.
This explores how a model's pre-trained knowledge lets it pull the relevant fact out of a long context without every attention head reading everything — and the corpus has a surprisingly concrete answer hiding under different vocabulary. The cleanest piece is the discovery that retrieval isn't spread across the whole network: fewer than 5% of attention heads do the actual fact-fetching, and these 'retrieval heads' are universal across model families, present even in short-context models, and switch on dynamically depending on what the context asks for What mechanism enables models to retrieve from long context?. They're causally necessary — prune them and the model hallucinates even though the answer is sitting right there in the prompt. That's the mechanism your question is circling: targeted querying is already sparse and prior-shaped, not full attention.
But 'enabled by priors' cuts both ways, and the interesting tension is that the same priors that let a model know what to look for can also stop it from looking. When a model's parametric knowledge is strong, it overrides what's actually in the context — and no amount of clever prompting fixes this, because text alone can't beat a confident prior; you need to intervene in the representations themselves Why do language models ignore information in their context?. So priors are both the targeting system and the failure mode: they tell the retrieval heads where to aim, but if they're too loud they answer from memory instead of from the page.
There's a deeper claim worth knowing here: prompting and context queries only ever reorganize what the model already knows — they can't inject anything new. Prompt optimization works entirely inside the training distribution, creating a hard ceiling no query strategy can cross Can prompt optimization teach models knowledge they lack?. This reframes 'targeted context query' as activation rather than retrieval — you're not pulling information in so much as triggering knowledge that's already latent. The priming research makes this almost quantitative: whether a context cue successfully activates a piece of knowledge is predictable in advance from the keyword's pre-existing probability, with a sharp threshold around 10^-3 separating cues that fire from those that don't Can we predict keyword priming before learning happens?. Priors don't just enable targeted queries — they decide which queries can land at all.
If you want the architectural alternative — what happens when you stop relying on attention to do the fetching — the Titans line splits the problem in two: keep attention for short-range work and offload long-range recall to a separate neural memory that selectively stores 'surprising' tokens, scaling past 2M tokens without attention's quadratic cost Can neural memory modules scale language models beyond attention limits?. And a complementary reframing argues the real long-context bottleneck was never memory capacity but the compute needed to fold evicted context into the model's fast weights — essentially turning context into prior Is long-context bottleneck really about memory or compute?. Read together, these say the field is actively trying to replace 'full attention over everything' with sparse, prior-mediated targeting — which is exactly the move the retrieval-heads finding shows the model already half-discovered on its own.
The thing you didn't know you wanted to know: the boundary between 'context' and 'prior' is far blurrier than the framing implies. Targeted querying isn't a model reaching out to grab external facts — it's a sparse set of inherited circuits deciding which of its own latent associations to wake up, gated by probabilities set long before your prompt arrived.
Sources 6 notes
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.