Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say

Paper · arXiv 2509.21164 · Published September 25, 2025

Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LLM approaches either (i) route a query to one or a few experts and generate independently, (ii) aggregate outputs from each model via costly multi-turn exchanges, or (iii) fuse weights into a single model—typically requiring architectural homogeneity. We introduce Mixture of Thoughts (MoT), a simple method for latent-level collaboration among heterogeneous experts under a global routing scheme. For each query, a lightweight router selects top-K experts and designates a primary expert; uniformly placed interaction layers project hidden states into a shared latent space where the primary expert performs cross-attention over its active (selected) peers. Pre-trained experts remain frozen; only the router and the lightweight interaction layers are trained with a novel joint training objective that improves both the expert selection and inter-expert collaboration.

restricting collaboration to outputs, weight fusion, or naive ensembling. To address this gap, we propose Mixture of Thoughts (MOT), a method for latent-level collaboration among heterogeneous experts under a global router. For each query, a lightweight trainable router selects the top-K experts and designates a primary expert. Within each expert, we uniformly place interaction layers composed of lightweight adapters (projections) and a cross-attention module. At these layers, all active experts’ hidden states are projected into a shared latent space and integrated into the primary expert via cross-attention—enabling fine-grained collaboration while keeping each expert’s backbone frozen. A joint training objective optimizes both the router and interaction layers. MOT performs a single forward pass through selected experts, achieving routing-like efficiency with richer representational capacity than single- and multi-model routing, response-level aggregation, and parameter fusion, while supporting fully heterogeneous experts with dynamic, per-query adaptive collaboration (Fig. 2).