How does modality-specific sparsity enable capacity flexibility that dense models cannot provide?

This explores how letting a model spend its parameters per-token — turning capacity on only where a given input (a vision token vs. a language token) needs it — solves problems that a fixed, fully-dense network runs into when different modalities have to share the same weights.

This explores how sparsity that adapts to what each token is — image vs. text — buys a kind of flexibility a dense model can't, because a dense model forces every input through the same fixed set of weights. The sharpest evidence is on modality competition: when you train one network on both vision and language, the two fight over the same parameters, and the usual story is that they're just incompatible. The corpus pushes back — the fight turns out to be architectural, not inherent. Rigid dense capacity allocation is what creates the bottleneck, and Mixture-of-Experts dissolves it by routing each token to its own experts, so vision and language stop competing for the same slots and can coexist Can we solve modality competition through architectural design?. That's the core mechanism: dense means "everyone shares," sparse means "each token gets what it needs."

Why this is a free lunch rather than a trade-off shows up in the attention work. The intuition is that sparsity saves compute by throwing away quality — but the Sparse Frontier benchmark finds the opposite. At equal compute budget, a larger sparse model beats a smaller dense one on long-context tasks, because sparsity lets you afford a bigger model for the same cost Does sparse attention trade off quality for speed?. So sparsity isn't just a way to fit competing modalities side by side; it's a way to grow total capacity without paying dense prices for it. The flexibility and the efficiency are the same coin.

What's quietly fascinating is that models seem to reach for sparsity on their own, even without anyone designing it in. Hidden states get sharply sparser when a task is unfamiliar or out-of-distribution — and this acts as a stabilizing filter, not a breakdown Do language models sparsify their activations under difficult tasks?. The companion finding is that density is learned: networks build dense activations for the data they've seen a lot of, and default to sparse ones for the unfamiliar Is representational sparsity learned or intrinsic to neural networks?. Read together, these say capacity allocation is something models naturally make conditional on the input — engineered modality-specific sparsity is just making deliberate what the network already gropes toward.

There's a reason this matters specifically for modalities and not only for efficiency. Text-only models inherit the abstraction limits baked into language — text strips out physics, geometry, and causality, so symbol-manipulation alone produces predictable failures on physical reasoning Are text-only language models fundamentally limited by abstraction?. The way out is multimodal grounding, which means you have to host genuinely different kinds of representation in one model — exactly the situation where dense sharing breaks down and per-token capacity becomes the enabling trick rather than a nice-to-have. And the broader scaling literature hints capacity flexibility is multidimensional: for tiny models, depth beats width because layering composes abstractions better than spreading parameters Does depth matter more than width for tiny language models? — another sign that *how* you allocate capacity matters more than how much you have.

The thing you might not have known you wanted to know: sparsity isn't primarily a compression story here. It's a *coexistence* story. Dense models impose a single shared budget on inputs that have fundamentally different needs, and the cost shows up as modalities cannibalizing each other. Routing capacity per token turns that zero-sum fight into something closer to additive — which is why the same mechanism that lets vision and language share a brain also lets a sparse model out-punch a dense one at equal cost.

Sources 6 notes

Can we solve modality competition through architectural design?

Modality competition arises from caption distributional shift and rigid dense capacity allocation, not from vision and language being fundamentally incompatible. Mixture of Experts resolves the architectural bottleneck by allocating capacity per token, enabling modalities to coexist without competing.

Does sparse attention trade off quality for speed?

The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

How does modality-specific sparsity enable capacity flexibility that dense models cannot provide?

Sources 6 notes

Next inquiring lines