Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
We propose MODEL SWARMS, a collaborative search algorithm to adapt LLMs via swarm intelligence, the collective behavior guiding individual systems. Specifically, MODEL SWARMS starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and optimize a utility function representing model adaptation objectives. Compared to existing model composition approaches, MODEL SWARMS offers tuning-free model adaptation, works in low data regimes with as few as 200 examples, and does not require assumptions about specific experts in the swarm or how they should be composed. Extensive experiments demonstrate that MODEL SWARMS could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts.
tasks, recent work has increasingly recognized the importance of modularity through multi-LLM collaboration, where diverse models interact and complement each other in various ways (Shen et al., 2024c; Feng et al., 2024a; Chan et al., 2024; Du et al., 2024). For example, mixture-of-experts (MoE) relies on the routing of queries to various neural sub-components, leveraging the specialized expertise of one model (Masoudnia & Ebrahimpour, 2014; Roller et al., 2021; Pfeiffer et al., 2022; Jiang et al., 2024). Routing to domain-specific experts demonstrates great potential, while no new model/expert is produced in the MoE process. However, challenging real-world tasks often require flexible composition and adaptation to new domains and/or capabilities that go beyond the scope of an existing expert.
Two lines of work aim to extend multi-LLM collaboration beyond routing to compose and produce new adapted models. 1) Learn-to-fuse designs trainable components to “glue” experts together into a merged model, then fine-tunes the model with supervised objectives to produce compositional experts (Jiang et al., 2023b; Wang et al., 2024b; Bansal et al., 2024). These approaches often rely on large training sets to tune the learnable parts from scratch and hardly offer the modularity of seamlessly adding/removing experts. 2) Model arithmetic composes LLM experts by conducting arithmetic operations on model weights and/or token probabilities (Ilharco et al., 2023; Yu et al., 2024; Yadav et al., 2024; Mavromatis et al., 2024; Liu et al., 2024). These approaches often come with strong assumptions about the available experts and how the desired adaptation should be decomposed (e.g., lion indoors = lion outdoors + (dog indoors - dog outdoors) (Ilharco et al., 2023)). As such, a flexible approach that does not rely on excessive tuning data or strong assumptions about existing models is crucial for adapting diverse LLM experts for wide-ranging purposes.
To this end, we propose MODEL SWARMS, where multiple LLM experts collaboratively search for new adapted models in the weight space.
Specifically, to model the proactive search of LLMs instead of passive merging, each expert particle starts with a location (model weights) and a velocity (direction in the weight space). The velocity is iteratively impacted by inertia (the tendency to keep current velocity), personal best (the best-found location of a given particle), and global best/worst (the best/worst-found location among all particles), while LLM particles then take a step towards the updated velocity direction. These velocity factors enable LLM particles to chart an independent search path and explore the personal/global best neighborhoods. Thanks to the flexible search methodology, MODEL SWARMS does not need any supervised fine-tuning data or pre-existing knowledge about the LLM experts or the utility function, adapting LLM experts solely through collaborative search and movement guided by any model-to-scalar utility function.
Human Interest We present the comparison between pre- and post-MODEL SWARMS experts in the 16 human-nominated interest domains in Table 4. Through adaptation with MODEL SWARMS, experts improve 17.6% in LLM-as-a-judge scores and 17.0% in factuality scores on average when discussing the 16 topics and domains. Most importantly, human evaluation reveals that MODEL SWARMS features a 70.8% win rate against initial experts on average, in particular, with an impressive 96% win rate in the two most successful domains while still maintaining 44%:28%:28% on the unfamiliar and most challenging topics. This indicates that MODEL SWARMS outputs are consistently preferred by both automatic metrics and human users, indicating MODEL SWARMS’ great potential to produce domain-specialized and community-specific LLM experts.
Correctness Emergence In the collaborative search process, are LLM experts simply transferring existing capabilities from one model to another, or are they discovering new skills and expertise for adaptation? Specifically, there are four correctness levels for a question and the pool of LLM experts: 1 the answers of experts are all wrong; 2 less than half correct; 3 more than half correct; and 4 all correct. The correctness level for a question could change between the pre- and post-MODEL SWARMS experts (e.g. ( 1 → 3 ) indicates
that none of the experts answered correctly initially, but after MODEL SWARMS optimization more than half answered correctly.)
Human interest: in addition to preferences represented by reward models, it is crucial to adapt LLM experts directly to human: their preferences, personalized needs, and interest domains. Specifically, 13 human annotators nominated 16 interest domains (e.g., electric vehicles and PhD applications), we then employ GEMINI-PRO to synthesize 25:25 instructions in each domain as validation/test set. f is defined as LLM-as-a-judge (Zheng et al., 2023) 1-10 scores with Gemini on the validation set, while we evaluate the adaptation to human interest topics on three fronts: improvement in f scores, improvement in factuality with Facts&Evidence (Boonsanong et al., 2024), and human evaluation win rate comparing pre-swarm and post-swarm responses.
What if we need to compose experts fine-tuned from different base architectures? Instead of model weights, the swarm intelligence arithmetic could be seamlessly carried out on token probability distributions for token swarms.
Evolutionary Algorithms and LLMs MODEL SWARMS is in part inspired by particle swarm optimization (PSO) (Kennedy & Eberhart, 1995), an evolutionary algorithm (EA) solving optimization problems. This echoes a recent and contemporary uptake of EAs, especially genetic algorithms (GAs) in ML/LLMs (Zhao et al., 2023; Lange et al., 2023;Wu et al., 2024; Chao et al., 2024; Lange et al., 2024). EvolMerge (Akiba et al., 2024) seeks to compose a math LLM and a Japanese LLM through discovering better weight/layer and data flows guided by genetic algorithms. PromptBreeder (Fernando et al., 2024) seeks to search for specialized LLM prompts by maintaining a prompt population and conducting LLM-based crossover and mutation to produce better prompts, resembling GA processes. EvoPrompt (Guo et al., 2024a) also follows similar concepts of applying GAs for prompt optimization. We see two key differences between MODEL SWARMS and this line of existing research: most methods focus on improvements in prompt/data engineering (Fernando et al., 2024; Guo et al., 2024a), while MODEL SWARMS seek to adapt LLMs by changing model weights and inducing new expert capabilities (Figure 2), which is more fundamental and offers greater headroom for improvement; existing EA applications mostly employed genetic algorithms that necessitate much hand-crafted rules (Lambora et al., 2019) (how should two prompts/models crossover to produce new ones, how to mutate, etc.), while MODEL SWARMS is inspired by swarm intelligence that come with little to no manual engineering in the composition and collaboration of models.
MODEL SWARMS and Multi-Agent Systems The role of all “experts” in MODEL SWARMS is homogeneous, i.e. they pursue the same goal/adapt to the same objective as represented by utility function f. In multi-agent systems (Rame et al., 2022; Zaman et al., 2023; Ainsworth et al., 2023; Chan et al., 2024; Talebirad & Nadiri, 2023; Chen et al., 2023a; Zhang et al., 2024a; Abdelnabi et al., 2024; Kannan et al., 2023; Zeng et al., 2024; Guo et al., 2024b; Sun et al., 2024; Han et al., 2024; Ishibashi & Nishimura, 2024; Wang et al., 2024d; Zhao et al., 2024; Chen et al., 2024c; Hong et al., 2024; Smit et al., 2024; Chen et al., 2024a;b), the agents often have different roles to jointly complete a task, albeit those roles are more or less hand-crafted and especially through prompting. We envision future work on adapting MODEL SWARMS and automatically discovering heterogeneous and collaborative agents that jointly serve a purpose.
Three Key Strengths of MODEL SWARMS 1) training-free: by training-free we mean that the composition of models in MODEL SWARMS doesn’t require specific training objectives, loss function, gradient descent, or back propagation. This alleviates data dependency: by using as few as 200 examples MODEL SWARMS could produce better adapted experts, while that is only a bit over 3 batches for training-based approaches with a typical effective batch size of 64. 2) automatic discovery or assumption-free: instead of dictating the composition of models in A=B+C-D formulas, MODEL SWARMS automatically discover better adapted experts through swarm intelligence without making assumptions about experts and how they should be composed. 3) any adaptation objective: the collaborative search is only guided by a particle-to-scalar utility function f which could be any thing: dataset performance, reward model scores, human interests, and more.
Search Dynamics What exactly is happening during a MODEL SWARMS search and how did expert utility change in the process? We visualize the change of each particle as well as the global best in term of utility function f in Figure 15. Experts explore the weight space, their utility scores wax and wane, leading to consistent bumps in global best scores and consequently better adapted language models.
Prompt Variation We hypothesize that by optimizing the weights, MODEL SWARMS might offer stronger robustness to minor prompt changes. We employ GEMINIPRO to “Please paraphrase the question into 10 versions with minor differences.”, evaluate models on the 10 versions, and calculate the entropy of response distributions as indicators of sensitivity. Figure 14 demonstrates that MODEL SWARMS drastically reduce the sensitivity to minor prompt changes, while still being a bit shy of Geminiflash/ pro levels.