Can we solve modality competition through architectural design?
Does modality competition in multimodal models stem from fundamental training conflicts, or from specific architectural choices? Understanding the root cause could reveal whether the trade-off is solvable.
A common assumption in multimodal pretraining: vision and language are inherently at odds in a single model. Training on one degrades the other. The "modality tax" is treated as a feature of the joint training problem rather than a design artifact. Beyond Language Modeling argues this assumption is wrong, and identifies the architectural moves that dissolve the competition.
The paper localizes two sources of modality friction, neither of which is the visual modality itself. The first is distributional: image-text captions are a peculiar subset of language, and friction with general language pretraining comes from the caption distribution shift rather than from vision. Pure video is in fact complementary to language; general multimodal pretraining yields positive transfer for visual question-answering and world modeling. The second source is architectural: dense models rigidly allocate fixed capacity across modalities, and modality-specific feedforward networks only partially address this rigidity.
Mixture of Experts (MoE) resolves the architectural source. By learning to allocate capacity per token, MoE removes the rigid trade-off that dense models impose. Vision tokens activate vision-specialized experts; language tokens activate language-specialized experts; some experts learn to handle both. The capacity that any one modality consumes does not subtract from the other's available capacity. The competition was an artifact of forcing all tokens through the same dense parameters.
The deeper finding is that the "modality tax" is design-induced, not modality-induced. Models that hit the tax are diagnosable: their architecture is either treating captions as representative of all language (the distribution problem) or forcing dense allocation (the capacity problem). Fix either and the tax shrinks. Fix both and it largely disappears.
This argues for MoE as a structural rather than purely efficiency-driven choice in multimodal foundation models. The efficiency argument (sparse activation reduces inference cost) is well-known; the modality argument is newer. Multimodal MoE is the architecture that enables modalities with fundamentally different scaling behaviors to coexist without competing.
Related concepts in this collection
-
Are text-only language models fundamentally limited by abstraction?
Explores whether text's compression of physics, geometry, and causality into symbols creates an irreducible ceiling for language-only AI, and whether multimodal approaches can overcome this structural constraint.
same paper, the limit that makes multimodal pretraining necessary
-
Why do vision and language scale so differently?
IsoFLOP analysis reveals vision and language follow distinct scaling curves—vision demands far more training data than language at equivalent compute budgets. Understanding this asymmetry matters for designing multimodal architectures that serve both modalities well.
same paper, the scaling-law consequence of the architectural choice
-
Can models dynamically activate expert skills at inference time?
Can language models efficiently discover and compose task-specific capabilities on the fly without modifying base weights? This explores whether test-time adaptation through expert vector composition outperforms fixed fine-tuning approaches.
adjacent: another expert-composition approach for capacity flexibility
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
modality competition between vision and language is solvable architecturally — MoE provides capacity flexibility that dense models lack