What is the cost difference between filtering context versus attending to everything?
This explores the economics of two ways a model can handle a big pile of context — deciding what's worth attending to (sparse/selective) versus running full attention over everything (dense) — and what each actually costs.
This explores the economics of two ways a model can handle a big pile of context: filtering down to what matters versus attending to all of it at once. The surprising answer from the corpus is that filtering isn't just cheaper — it's often *better*, which upends the usual intuition that attention is a quality-for-speed trade.
The cleanest reframe comes from the Sparse Frontier work Does sparse attention trade off quality for speed?: at the same compute budget, a larger model that attends sparsely beats a smaller model that attends densely on long-context tasks. So sparsity doesn't move you *along* a cost-quality curve — it pushes the whole curve outward. Part of why is mechanical: only a tiny slice of attention heads (under 5%) actually do the long-range fact-fetching What mechanism enables models to retrieve from long context?. Most of the dense attention you'd pay for is doing nothing for factual recall — and pruning the few heads that matter causes hallucination even when the answer is sitting right there in context. Attending to everything is, in large part, paying for compute you don't use.
But the corpus also reframes where the real cost lives. The long-context bottleneck isn't memory — it's the *compute* to fold evicted context into the model's internal state, and that cost scales with how hard the reasoning is Is long-context bottleneck really about memory or compute?. That's the hidden price of "attend to everything": you're not just storing tokens, you're re-deriving meaning from them every pass. Filtering approaches sidestep this. Atom of Thoughts goes furthest — a Markov-style reasoning process where each step depends only on the current problem, not the accumulated history, dropping the baggage entirely without losing the answer Can reasoning systems forget history without losing coherence?. DeepRAG makes the filter a learned decision: treat "retrieve vs. rely on what I already know" as a step-by-step choice, and accuracy jumps 22% — much of it from *removing* the noise of context you didn't need When should language models retrieve external knowledge versus use internal knowledge?.
There's a deeper principle underneath all of this: spending should track difficulty, not be uniform. Compute-optimal scaling shows that handing every prompt the same budget wastes effort on easy ones and starves hard ones — adaptive allocation wins with the same total spend Can we allocate inference compute based on prompt difficulty?, How should we allocate compute budget at inference time?. "Attend to everything" is the uniform-budget mistake applied to context: it treats every token as equally worth the cost. Filtering is the adaptive version — confidence-aware step filtering even lets you stop early when a trace goes bad, matching majority-vote accuracy with far fewer traces Does step-level confidence outperform global averaging for trace filtering?.
The thing you didn't know you wanted to know: the cost question flips once context *persists*. In a 115-day agent study, 82.9% of tokens were cache reads — context that was filtered, kept, and reused so many times that "cost per token" stopped being the right unit at all; the meaningful denominator became completed work, not tokens processed Do persistent agents really cost less per token?. So the real difference isn't filtering vs. attending — it's whether you pay to re-attend to the same context over and over, or pay once to decide what's worth keeping.
Sources 9 notes
The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.