Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models

Paper · arXiv 2305.14705 · Published May 24, 2023

“Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models.”

“While the benefits of Large Language Models (LLMs) are indisputable, their rapidly growing size and computational requirements pose significant challenges in terms of training efficiency, memory footprint, and deployment costs. Consequently, there is a pressing need for developing scalable techniques that can harness the power of these models without incurring prohibitive computational overheads.

In the other hands, models with sparsely activated Mixture of Experts (MoEs) significantly reduce the computational cost of LLMs. MoE models build upon the observation that language models can be decomposed into smaller, specialized sub-models, or "experts", that focus on distinct aspects of the input data, thereby enabling more efficient computation and resource allocation. However, we show that conventional, task-specific finetuning MoE models lead to suboptimal performance, often even worse than finetuning dense models with the same computational cost. One of the possible reasons is the discrepancy between general pretraining and task-specific finetuning.”