Reasoning and Knowledge Reasoning and Learning Architectures

How much sparsity can different reasoning tasks actually tolerate?

Different NLP tasks show vastly different tolerance for sparse attention—from 95% on simple QA to 50-67% on multi-hop reasoning. What structural differences explain this variation, and how should it shape deployment decisions?

Note · 2026-05-18 · sourced from LLM Architecture

The Sparse Frontier benchmark separates tasks into groups and reports sparsity tolerance per group. The variation is dramatic. Single-QA tasks (QuALITY, SQuAD, TOEFL) tolerate sparsity 0.95 — running at a 1/20 attention budget with minimal degradation across all six methods evaluated. Multiple-QA tasks (Ruler NIAH, Story Retrieval) show substantial degradation at sparsity 0.8–0.9 (1/5 to 1/10 budget). Tasks with high scope or high information dispersion degrade even at modest sparsity (0.5–0.67) for some methods.

The pattern is structural. Single-QA tasks let a small subset of attention heads handle the entire reasoning load — find the relevant span, attend to it, generate the answer. The model can drop 95% of attention computation because only a few tokens were going to do the work anyway. Multi-hop tasks require attention to multiple regions and to the relationships between them. Each hop is a place where attention sparsification can lose the thread. Aggregation tasks require attention to many tokens whose individual contribution is small but whose collective signal is the answer. Dropping any of them is costly because no single retained token compensates.

The deployment risk this surfaces is concrete: a sparse-attention method that benchmarks well on single-QA may fail dramatically on multi-hop reasoning. Reporting sparsity tolerance only on easy tasks overstates how much sparsity is safe in production where task mix is heterogeneous. Robust deployment requires testing across diverse task characteristics — particularly across the scope (how many distinct facts the answer requires) and dispersion (how spread out those facts are in context) axes.

For builders, this argues against headline sparsity claims. "Our method runs at 95% sparsity" is true on QuALITY and misleading on aggregation. The relevant question is "what is the safe sparsity for the task mix we're deploying against?" — and the answer varies by deployment, not just by method.

The methodological consequence for benchmark designers: sparsity-tolerance benchmarks need to span scope and dispersion, not just topical diversity. A benchmark suite covering only single-QA tasks will reward methods that fail in production.

Related concepts in this collection

Concept map
13 direct connections · 151 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

sparsity tolerance is task-dependent — single QA tolerates 95 percent sparsity while multi-hop and aggregation tasks fail at 50-67 percent