Reasoning and Learning Architectures

Can we solve modality competition through architectural design?

Does modality competition in multimodal models stem from fundamental training conflicts, or from specific architectural choices? Understanding the root cause could reveal whether the trade-off is solvable.

Note · 2026-05-18 · sourced from Multimodal

A common assumption in multimodal pretraining: vision and language are inherently at odds in a single model. Training on one degrades the other. The "modality tax" is treated as a feature of the joint training problem rather than a design artifact. Beyond Language Modeling argues this assumption is wrong, and identifies the architectural moves that dissolve the competition.

The paper localizes two sources of modality friction, neither of which is the visual modality itself. The first is distributional: image-text captions are a peculiar subset of language, and friction with general language pretraining comes from the caption distribution shift rather than from vision. Pure video is in fact complementary to language; general multimodal pretraining yields positive transfer for visual question-answering and world modeling. The second source is architectural: dense models rigidly allocate fixed capacity across modalities, and modality-specific feedforward networks only partially address this rigidity.

Mixture of Experts (MoE) resolves the architectural source. By learning to allocate capacity per token, MoE removes the rigid trade-off that dense models impose. Vision tokens activate vision-specialized experts; language tokens activate language-specialized experts; some experts learn to handle both. The capacity that any one modality consumes does not subtract from the other's available capacity. The competition was an artifact of forcing all tokens through the same dense parameters.

The deeper finding is that the "modality tax" is design-induced, not modality-induced. Models that hit the tax are diagnosable: their architecture is either treating captions as representative of all language (the distribution problem) or forcing dense allocation (the capacity problem). Fix either and the tax shrinks. Fix both and it largely disappears.

This argues for MoE as a structural rather than purely efficiency-driven choice in multimodal foundation models. The efficiency argument (sparse activation reduces inference cost) is well-known; the modality argument is newer. Multimodal MoE is the architecture that enables modalities with fundamentally different scaling behaviors to coexist without competing.

Related concepts in this collection

Concept map
13 direct connections · 137 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

modality competition between vision and language is solvable architecturally — MoE provides capacity flexibility that dense models lack