How does mixture of experts enable flexible capacity sharing between modalities?
This explores how Mixture of Experts (MoE) lets vision and language share a model's capacity flexibly — giving each one what it needs token-by-token instead of forcing them to compete for the same fixed weights.
This explores how Mixture of Experts lets different modalities — like vision and language — share a model's capacity without crowding each other out. The cleanest answer in the corpus reframes the whole problem: when a multimodal model trains poorly, it's tempting to assume vision and language are just fundamentally incompatible. They aren't. The competition is *architectural* — it comes from a rigid dense layer that hands every token the same slice of capacity, plus the distributional gap between captions and ordinary text. MoE dissolves the bottleneck by routing each token to specialized experts, so capacity is allocated per token rather than split evenly in advance. Modalities stop fighting over one shared budget and instead draw on the experts that suit them Can we solve modality competition through architectural design?.
The deeper idea worth carrying away is that MoE isn't really about modalities at all — it's about *conditional computation*, deciding which parameters to wake up based on the input. Once you see it that way, the same mechanism shows up solving adjacent problems. One line of work pairs MoE routing with a cheap O(1) memory lookup and finds the two are complementary axes of sparsity: a balanced mix of "look it up" and "compute it" beats pouring all your parameters into either one, following a U-shaped scaling curve Can lookup memory and computation work together better than either alone?. That's the same flexibility-of-allocation principle, applied to memory versus computation rather than vision versus language.
The "experts" framing also turns out to be far more fluid than fixed routing tables suggest. Instead of training experts once and freezing the router, models can compose expert skills *at inference time* — one method tunes only the singular values inside weight matrices to produce expert vectors that mix dynamically without interfering with each other, beating heavier adapters with fewer parameters Can models dynamically activate expert skills at inference time?. Push further and you don't even need gradients: swarms of model "particles" can search weight space collaboratively and discover composed experts that answer questions every starting expert got wrong Can language models discover new expertise through collaborative weight search?. The thread connecting all of these is that capability can be *assembled on demand* rather than baked in.
There's a useful tension lurking here, though. Specialization isn't free. Pushing a model toward narrow domain expertise tends to erode its general reasoning — supervised fine-tuning can buy domain accuracy while quietly degrading reasoning quality, and every technique seems to have a sweet spot past which it backfires How do you add domain expertise without losing general reasoning?. MoE's appeal is that it sidesteps this by keeping experts *alongside* each other instead of overwriting shared weights — capacity is shared, not sacrificed. But the corpus also hints that more experts isn't automatically better: diversity only pays off when each contributor brings real competence, otherwise you get interference instead of insight Does cognitive diversity alone improve multi-agent ideation quality?.
So the surprising takeaway: "flexible capacity sharing between modalities" is one instance of a much broader bet that runs through this collection — that the future of capable models is *routing and composition* over monolithic dense capacity. Whether the experts are modality-specific layers, inference-time skill vectors, or even whole models voting toward a consensus that outperforms any individual Can models trained on many imperfect experts outperform each one?, the recurring move is the same: decide what to activate based on the input, and let specialization coexist instead of compete.
Sources 7 notes
Modality competition arises from caption distributional shift and rigid dense capacity allocation, not from vision and language being fundamentally incompatible. Mixture of Experts resolves the architectural bottleneck by allocating capacity per token, enabling modalities to coexist without competing.
Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.
SFT raises domain accuracy but reduces reasoning quality by 38% InfoGain loss. RL improves domain reasoning by pruning rather than adding capability. Every technique has a domain-specific sweet spot beyond which performance degrades.
Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.
Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.