[论文翻译]Model Swarms: Collaborative Search to Adapt LLM EXPERTS VIA SWARM INTELLIG

Doc2X：PDF转换与翻译领域的佼佼者借助 Doc2X，您可以快速完成 PDF转Docx、Markdown、Latex 等格式转换，同时体验表格识别、公式编辑、双语对照翻译等贴心功能！ Doc2X: A Leader in PDF Conversion and Translation With Doc2X, quickly convert PDF to Docx, Markdown, or LaTeX, and enjoy features like table recognition, formula editing, and bilingual translation for added convenience! 👉 了解更多 Doc2X 功能 | Learn More About Doc2X

原文链接：arxiv.org/pdf/2410.11…

Model Swarms: Collaborative Search to Adapt LLM EXPERTS VIA SWARM INTELLIGENCE

模型集群：通过集群智能协作搜索来适应大型语言模型专家

Shangbin Feng ${}^{1 * }$ Zifeng Wang ${}^{2}\;$ Yike Wang ${}^{1}\;$ Sayna Ebrahimi ${}^{3}$

冯尚斌 ${}^{1 * }$ 王自峰 ${}^{2}\;$ 王依珂 ${}^{1}\;$ Sayna Ebrahimi ${}^{3}$

Hamid Palangi ${}^{2}\;$ Lesly Miculicich ${}^{2}\;$ Achin Kulshrestha ${}^{4}\;$ Nathalie Rauschmayr ${}^{4}$ Yejin Choi ${}^{1}\;$ Yulia Tsvetkov ${}^{1}\;$ Chen-Yu Lee ${}^{2}\;$ Tomas Pfister ${}^{2}$

${}^{1}$ University of Washington ${}^{2}$ Google Cloud AI Research ${}^{3}$ Google DeepMind ${}^{4}$ Google

${}^{1}$ 华盛顿大学 ${}^{2}$ Google Cloud AI 研究 ${}^{3}$ Google DeepMind ${}^{4}$ Google

ABSTRACT

We propose MODEL SWARMS, a collaborative search algorithm to adapt LLMs via swarm intelligence, the collective behavior guiding individual systems. Specifically, MODEL SWARMS starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and optimize a utility function representing model adaptation objectives. Compared to existing model composition approaches, MODEL SWARMS offers tuning-free model adaptation, works in low-data regimes with as few as 200 examples, and does not require assumptions about specific experts in the swarm or how they should be composed. Extensive experiments demonstrate that MODEL SWARMS could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests,improving over 12 model composition baselines by up to ${21.0}\%$ across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that MODEL SWARMS enable the weak-to-strong transition of experts through the collaborative search process.

我们提出了模型集群（MODEL SWARMS），这是一种通过集群智能协作搜索来适应大型语言模型（LLMs）的算法。具体来说，模型集群从一个大型语言模型专家池和一个效用函数开始。在跨模型最佳检查点的指导下，多样化的LLM专家在权重空间中协作移动，并优化一个代表模型适应目标的效用函数。与现有的模型组合方法相比，模型集群提供了无需调优的模型适应，能够在低数据环境下工作，仅需200个示例，并且不需要对集群中的特定专家或它们应如何组合做出假设。广泛的实验表明，模型集群能够灵活地将LLM专家适应于单任务、多任务领域、奖励模型以及多样化的人类兴趣，在任务和上下文中，其性能在12个模型组合基线上提高了多达 ${21.0}\%$ 。进一步的分析揭示，LLM专家在初始检查点中发现了以前未见的能力，并且模型集群通过协作搜索过程实现了专家从弱到强的转变。

1 INTRODUCTION

1 引言

Advancing beyond efforts to train a single, universal large language model (LLM) (Brown et al., 2020; Gemini Team et al., 2023) that shares parameters across all languages and tasks, recent work has increasingly recognized the importance of modularity through multi-LLM collaboration, where diverse models interact and complement each other in various ways (Shen et al., 2024c; Feng et al., 2024a; Chan et al., 2024; Du et al., 2024). For example, mixture-of-experts (MoE) relies on the routing of queries to various neural sub-components, leveraging the specialized expertise of one model (Masoudnia & Ebrahimpour, 2014; Roller et al., 2021; Pfeiffer et al., 2022; Jiang et al., 2024). Routing to domain-specific experts demonstrates great potential, while no new model/expert is produced in the MoE process. However, challenging real-world tasks often require flexible composition and adaptation to new domains and/or capabilities that go beyond the scope of an existing expert.

超越了训练单一、通用的巨型语言模型（LLM）（Brown et al., 2020; Gemini Team et al., 2023）的努力，这些模型在所有语言和任务中共享参数，最近的研究越来越认识到通过多LLM协作实现模块化的重要性，其中不同的模型以各种方式相互作用并互补（Shen et al., 2024c; Feng et al., 2024a; Chan et al., 2024; Du et al., 2024）。例如，专家混合（MoE）依赖于将查询路由到各种神经子组件，利用一个模型的专业知识（Masoudnia & Ebrahimpour, 2014; Roller et al., 2021; Pfeiffer et al., 2022; Jiang et al., 2024）。路由到特定领域的专家展示了巨大的潜力，而在MoE过程中不会产生新的模型/专家。然而，具有挑战性的现实任务通常需要灵活的组合和适应新领域和/或超出现有专家范围的能力。

Two lines of work aim to extend multi-LLM collaboration beyond routing to compose and produce new adapted models. 1) Learn-to-fuse designs trainable components to "glue" experts together into a merged model, then fine-tunes the model with supervised objectives to produce compositional experts (Jiang et al., 2023b; Wang et al., 2024b; Bansal et al., 2024). These approaches often rely on large training sets to tune the learnable parts from scratch and hardly offer the modularity of seamlessly adding/removing experts. 2) Model arithmetic composes LLM experts by conducting arithmetic operations on model weights and/or token probabilities (Ilharco et al., 2023; Yu et al., 2024; Yadav et al., 2024; Mavromatis et al., 2024; Liu et al., 2024). These approaches often come with strong assumptions about the available experts and how the desired adaptation should be decomposed (e.g., lion indoors = lion outdoors + (dog indoors - dog outdoors) (Ilharco et al., 2023)). As such, a flexible approach that does not rely on excessive tuning data or strong assumptions about existing models is crucial for adapting diverse LLM experts for wide-ranging purposes.

两项工作旨在将多LLM协作扩展到路由之外，以组合和生成新的适应模型。1) 学习融合设计可训练组件，将专家“粘合”在一起形成合并模型，然后通过监督目标对模型进行微调，以生成组合专家（Jiang et al., 2023b; Wang et al., 2024b; Bansal et al., 2024）。这些方法通常依赖于大型训练集从头开始调整可学习部分，并且几乎不提供无缝添加/移除专家的模块性。2) 模型算术通过在模型权重和/或令牌概率上进行算术运算来组合LLM专家（Ilharco et al., 2023; Yu et al., 2024; Yadav et al., 2024; Mavromatis et al., 2024; Liu et al., 2024）。这些方法通常对可用专家和所需适应的分解方式有强假设（例如，室内狮子 = 室外狮子 + (室内狗 - 室外狗)（Ilharco et al., 2023））。因此，一种不依赖于过多调整数据或对现有模型有强假设的灵活方法对于适应多样化的LLM专家以实现广泛目的是至关重要的。

To this end, we propose MODEL SWARMS, where multiple LLM experts collaboratively search for new adapted models in the weight space. Inspired by Particle Swarm Optimization (PSO) (Kennedy & Eberhart, 1995), MODEL SWARMS views each LLM expert as a "particle" and defines LLM adaptation as the collaborative movement of particles governed by a utility function representing an

为此，我们提出了MODEL SWARMS，其中多个LLM专家在权重空间中协作搜索新的适应模型。受粒子群优化（PSO）（Kennedy & Eberhart, 1995）的启发，MODEL SWARMS将每个LLM专家视为一个“粒子”，并将LLM适应定义为由表示为效用函数的协同粒子运动所支配的过程。

*Work done as a student researcher at Google Cloud AI Research. Corresponde to: Shangbin Feng (shang-bin@cs.washington.edu), Zifeng Wang (zifengw@google.com), and Chen-Yu Lee (chenyulee@google.com).

*在Google Cloud AI Research担任学生研究员期间完成的工作。对应作者：Shangbin Feng (shang-bin@cs.washington.edu)、Zifeng Wang (zifengw@google.com) 和 Chen-Yu Lee (chenyulee@google.com)。

Figure 1: We propose MODEL SWARMS, a collaborative search algorithm to adapt LLM experts via swarm intelligence. Guided by personal best ${\mathbf{p}}_{i}$ ,global best $\mathbf{g}$ ,and global worst ${\mathbf{g}}_{w}$ ,LLM experts update its velocity $\mathbf{v}$ and location $\mathbf{x}$ to explore the weight space and optimize a utility function $f$ . The best-found expert (global best $\mathbf{g}$ ) in the end is retained as the output.

图1：我们提出MODEL SWARMS，一种通过群体智能来适应LLM专家的协作搜索算法。在个人最佳 ${\mathbf{p}}_{i}$ 、全局最佳 $\mathbf{g}$ 和全局最差 ${\mathbf{g}}_{w}$ 的指导下，LLM专家更新其速度 $\mathbf{v}$ 和位置 $\mathbf{x}$ 以探索权重空间并优化效用函数 $f$ 。最终找到的最佳专家（全局最佳 $\mathbf{g}$ ）被保留作为输出。

adaptation objective. Specifically, to model the proactive search of LLMs instead of passive merging, each expert particle starts with a location (model weights) and a velocity (direction in the weight space). The velocity is iteratively impacted by inertia (the tendency to keep current velocity), personal best (the best-found location of a given particle), and global best/worst (the best/worst-found location among all particles), while LLM particles then take a step towards the updated velocity direction. These velocity factors enable LLM particles to chart an independent search path and explore the personal/global best neighborhoods. Thanks to the flexible search methodology, MODEL SWARMS does not need any supervised fine-tuning data or pre-existing knowledge about the LLM experts or the utility function, adapting LLM experts solely through collaborative search and movement guided by any model-to-scalar utility function.

适应目标。具体来说，为了模拟LLM的主动搜索而非被动合并，每个专家粒子从一个位置（模型权重）和一个速度（权重空间中的方向）开始。速度会受到惯性（保持当前速度的趋势）、个人最佳（给定粒子找到的最佳位置）和全局最佳/最差（所有粒子中找到的最佳/最差位置）的影响，而LLM粒子随后朝更新后的速度方向迈出一步。这些速度因素使LLM粒子能够绘制独立的搜索路径并探索个人/全局最佳邻域。得益于灵活的搜索方法，MODEL SWARMS不需要任何监督微调数据或关于LLM专家或效用函数的先验知识，仅通过协作搜索和由任何模型到标量效用函数引导的运动来适应LLM专家。

MODEL SWARMS achieves superior performance across four distinct LLM adaptation objectives:

MODEL SWARMS在四个不同的LLM适应目标上实现了卓越的性能：

Single task: Optimizing over as few as 200 instances, MODEL SWARMS outperforms 12 model composition baselines by ${13.3}\%$ across 9 datasets spanning knowledge,reasoning,and safety.
单一任务：在仅优化200个实例的情况下，MODEL SWARMS在跨越知识、推理和安全领域的9个数据集上，超越了12个模型组合基线 ${13.3}\%$ 。
Multi-task domain: Jointly optimizing multiple tasks in medical, legal, scientific, and cultural domains, MODEL SWARMS often produces Pareto-optimal experts than optimizing a single task.
多任务领域：在医疗、法律、科学和文化等多个领域联合优化多个任务，MODEL SWARMS 通常能产生比优化单一任务更优的帕累托最优专家。
Reward model: Optimizing reward model scores of general and conflicting preferences, MODEL SWARMS offers steerable experts that outperform baselines by up to 14.6% in controllability.
奖励模型：优化通用和冲突偏好的奖励模型分数，MODEL SWARMS 提供了可操控的专家，其可操控性比基线高出最多 14.6%。
Human interest: On 16 topics evaluated by humans (e.g., electric vehicles and PhD applications), Model Swarms produces experts on par or better than existing models in 85% of cases.
人类兴趣：在由人类评估的 16 个主题（例如电动汽车和博士申请）上，Model Swarms 在 85% 的情况下产生的专家与现有模型相当或优于现有模型。

Empirical analyses reveal that the diversity of starting experts is crucial, models display emerging capabilities not seen in initial checkpoints, and surprisingly, the best ending particle often did not start as the best. MODEL SWARMS could be accelerated with dropout-like strategies and seamlessly extended to token probability arithmetic for experts with different model architectures. We envision MODEL SWARMS as a versatile framework to reimagine the potential of diverse open models.

实证分析表明，初始专家的多样性至关重要，模型展现出在初始检查点中未见的新兴能力，并且令人惊讶的是，最佳的最终粒子往往并非从最佳开始。MODEL SWARMS 可以通过类似 dropout 的策略加速，并且可以无缝扩展到不同模型架构专家的标记概率算术。我们设想 MODEL SWARMS 作为一个多功能框架，重新设想多样化开放模型的潜力。

2 Methodology

2 方法论

We propose MODEL SWARMS, a collaborative search algorithm to adapt LLM experts via swarm intelligence. We present an overview of MODEL SWARMS in Figure 1 and Algorithm 1.

我们提出 MODEL SWARMS，一种通过群体智能适应 LLM 专家的协作搜索算法。我们在图 1 和算法 1 中概述了 MODEL SWARMS。

MODEL SWARMS assumes the access to various LLM experts ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ ,which could be full models or LoRA adapters (Hu et al., 2022) fine-tuned on diverse tasks and domains publicly available on model-sharing platforms (Wolf et al.,2019). It also requires a utility function $f : \mathbf{x} \rightarrow \mathcal{R}$ ,mapping each expert onto a scalar value that should be optimized for model adaptation. Utility functions could be dataset performance, reward model scores, or human preferences (Section 3).

MODEL SWARMS 假设可以访问各种 LLM 专家 ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ ，这些专家可以是完整模型或 LoRA 适配器（Hu et al., 2022），它们在模型共享平台上公开可用，并针对不同任务和领域进行了微调（Wolf et al., 2019）。它还需要一个效用函数 $f : \mathbf{x} \rightarrow \mathcal{R}$ ，将每个专家映射到一个标量值，该值应针对模型适应进行优化。效用函数可以是数据集性能、奖励模型分数或人类偏好（第 3 节）。

Inspired by Particle Swarm Optimization (Kennedy & Eberhart, 1995) and evolutionary algorithms in general (Bäck & Schwefel, 1993), MODEL SWARMS employs several terminologies:

受粒子群优化（Kennedy & Eberhart, 1995）和一般进化算法（Bäck & Schwefel, 1993）的启发，MODEL SWARMS 采用了几个术语：

Each LLM expert, or "particle" in the model swarm, has a location represented by model weights;
每个 LLM 专家，或模型群中的“粒子”，都有一个由模型权重表示的位置；
Each particle has a velocity, a direction in the model weight space that should move towards next;
每个粒子都有一个速度，即在模型权重空间中应朝向的下一个方向；
Personal best ${\mathbf{p}}_{i}$ : the best-found location of ${\mathbf{x}}_{i}$ based on utility function $f$ in its search history;
个人最佳 ${\mathbf{p}}_{i}$ ：基于效用函数 $f$ 在其搜索历史中找到的 ${\mathbf{x}}_{i}$ 的最佳位置；

Algorithm 1: Model Swarms

算法 1：模型群

Input: LLM experts ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ ,utility function $f : \mathbf{x} \rightarrow \mathcal{R}$ ; Hyperparameters: swarm size $N$ ,step length $\lambda$ ,step length schedule ${\phi }_{\lambda }$ ,inertia ${\phi }_{v}$ ,cognitive coefficient ${\phi }_{p}$ ,social coefficient ${\phi }_{g}$ ,repel coefficient ${\phi }_{w}$ ,patience $c$ ,restart patience ${c}_{r}$ ,max iteration $\mathcal{K}$ // initialize search pairwise interpolation to populate initial experts ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{N} =$ populate $\left( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}\right) ,N > n$ initialize global best checkpoint $\mathbf{g} \leftarrow \varnothing$ ,global worst checkpoint ${\mathbf{g}}_{w} \leftarrow \varnothing$ for $i = 1$ to $N$ do initialize personal best ${\mathbf{p}}_{i} \leftarrow {\mathbf{x}}_{i}$ ,velocity ${\mathbf{v}}_{i} \leftarrow \operatorname{random}\left( {\left\{ {\mathbf{x}}_{j}\right\} }_{j = 1}^{N}\right) - {\mathbf{x}}_{i}$ if $f\left( {\mathbf{x}}_{i}\right) > f\left( \mathbf{g}\right) ,\mathbf{g} \leftarrow {\mathbf{x}}_{i};$ if $f\left( {\mathbf{x}}_{i}\right) < f\left( {\mathbf{g}}_{w}\right) ,{\mathbf{g}}_{w} \leftarrow {\mathbf{x}}_{i}$ end // search! for $k = 1$ to $\mathcal{K}$ do if $\mathbf{g}$ did not change in the last $c$ iterations then break for $i = 1$ to $N{\text{parallel}}^{ \dagger }$ do randomness factors ${r}_{v},{r}_{p},{r}_{g},{r}_{w} \sim \mathcal{U}\left( {0,1}\right)$ update velocity ${\mathbf{v}}_{i} \leftarrow \frac{1}{\mathcal{C}}\left\lbrack {{r}_{v}{\phi }_{v}{\mathbf{v}}_{i} + {r}_{p}{\phi }_{p}\left( {{\mathbf{p}}_{i} - {\mathbf{x}}_{i}}\right) + {r}_{g}{\phi }_{g}\left( {\mathbf{g} - {\mathbf{x}}_{i}}\right) - {r}_{w}{\phi }_{w}\left( {{\mathbf{g}}_{w} - {\mathbf{x}}_{i}}\right) }\right\rbrack$ ,where normalization term $\mathcal{C} = {r}_{v}{\phi }_{v} + {r}_{p}{\phi }_{p} + {r}_{g}{\phi }_{g} + {r}_{w}{\phi }_{w}$ update location ${\mathbf{x}}_{i} \leftarrow {\mathbf{x}}_{i} + \lambda {\mathbf{v}}_{i}$ if $f\left( {\mathbf{x}}_{i}\right) > f\left( \mathbf{g}\right) ,\mathbf{g} \leftarrow {\mathbf{x}}_{i};$ if $f\left( {\mathbf{x}}_{i}\right) < f\left( {\mathbf{g}}_{w}\right) ,{\mathbf{g}}_{w} \leftarrow {\mathbf{x}}_{i};$ if $f\left( {\mathbf{x}}_{i}\right) > f\left( {\mathbf{p}}_{i}\right) ,{\mathbf{p}}_{i} \leftarrow {\mathbf{x}}_{i}$ if $f\left( {\mathbf{p}}_{i}\right)$ didn’t change in ${c}_{r}$ iterations, ${\mathbf{x}}_{i} \leftarrow {\mathbf{p}}_{i}$ and ${\mathbf{v}}_{i} \leftarrow \mathbf{0}$ end step length scheduling $\lambda \leftarrow \lambda \times {\phi }_{\lambda }$ end return $\mathrm{g}$

输入：LLM 专家 ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ ，效用函数 $f : \mathbf{x} \rightarrow \mathcal{R}$ ；超参数：群大小 $N$ ，步长 $\lambda$ ，步长调度 ${\phi }_{\lambda }$ ，惯性 ${\phi }_{v}$ ，认知系数 ${\phi }_{p}$ ，社会系数 ${\phi }_{g}$ ，排斥系数 ${\phi }_{w}$ ，耐心 $c$ ，重启耐心 ${c}_{r}$ ，最大迭代次数 $\mathcal{K}$ // 初始化搜索成对插值以填充初始专家 ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{N} =$ 填充 $\left( {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}\right) ,N > n$ 初始化全局最佳检查点 $\mathbf{g} \leftarrow \varnothing$ ，全局最差检查点 ${\mathbf{g}}_{w} \leftarrow \varnothing$ 对于 $i = 1$ 到 $N$ 做初始化个人最佳 ${\mathbf{p}}_{i} \leftarrow {\mathbf{x}}_{i}$ ，速度 ${\mathbf{v}}_{i} \leftarrow \operatorname{random}\left( {\left\{ {\mathbf{x}}_{j}\right\} }_{j = 1}^{N}\right) - {\mathbf{x}}_{i}$ 如果 $f\left( {\mathbf{x}}_{i}\right) > f\left( \mathbf{g}\right) ,\mathbf{g} \leftarrow {\mathbf{x}}_{i};$ 如果 $f\left( {\mathbf{x}}_{i}\right) < f\left( {\mathbf{g}}_{w}\right) ,{\mathbf{g}}_{w} \leftarrow {\mathbf{x}}_{i}$ 结束 // 搜索！对于 $k = 1$ 到 $\mathcal{K}$ 做如果 $\mathbf{g}$ 在过去 $c$ 次迭代中没有改变则中断对于 $i = 1$ 到 $N{\text{parallel}}^{ \dagger }$ 做随机性因子 ${r}_{v},{r}_{p},{r}_{g},{r}_{w} \sim \mathcal{U}\left( {0,1}\right)$ 更新速度 ${\mathbf{v}}_{i} \leftarrow \frac{1}{\mathcal{C}}\left\lbrack {{r}_{v}{\phi }_{v}{\mathbf{v}}_{i} + {r}_{p}{\phi }_{p}\left( {{\mathbf{p}}_{i} - {\mathbf{x}}_{i}}\right) + {r}_{g}{\phi }_{g}\left( {\mathbf{g} - {\mathbf{x}}_{i}}\right) - {r}_{w}{\phi }_{w}\left( {{\mathbf{g}}_{w} - {\mathbf{x}}_{i}}\right) }\right\rbrack$ ，其中归一化项 $\mathcal{C} = {r}_{v}{\phi }_{v} + {r}_{p}{\phi }_{p} + {r}_{g}{\phi }_{g} + {r}_{w}{\phi }_{w}$ 更新位置 ${\mathbf{x}}_{i} \leftarrow {\mathbf{x}}_{i} + \lambda {\mathbf{v}}_{i}$ 如果 $f\left( {\mathbf{x}}_{i}\right) > f\left( \mathbf{g}\right) ,\mathbf{g} \leftarrow {\mathbf{x}}_{i};$ 如果 $f\left( {\mathbf{x}}_{i}\right) < f\left( {\mathbf{g}}_{w}\right) ,{\mathbf{g}}_{w} \leftarrow {\mathbf{x}}_{i};$ 如果 $f\left( {\mathbf{x}}_{i}\right) > f\left( {\mathbf{p}}_{i}\right) ,{\mathbf{p}}_{i} \leftarrow {\mathbf{x}}_{i}$ 如果 $f\left( {\mathbf{p}}_{i}\right)$ 在 ${c}_{r}$ 次迭代中没有改变， ${\mathbf{x}}_{i} \leftarrow {\mathbf{p}}_{i}$ 和 ${\mathbf{v}}_{i} \leftarrow \mathbf{0}$ 结束步长调度 $\lambda \leftarrow \lambda \times {\phi }_{\lambda }$ 结束返回 $\mathrm{g}$

Global best and worst $\mathbf{g}$ and ${\mathbf{g}}_{w}$ : the best/worst location in all of ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ ’s search history.
全局最佳和最差 $\mathbf{g}$ 和 ${\mathbf{g}}_{w}$ ：在 ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ 的所有搜索历史中的最佳/最差位置。

The location and velocity of particles enable the proactive search of LLM experts instead of passive merging, while the personal/global best checkpoints help keep track of good locations and neighborhoods in the weight space to further explore.

粒子的位置和速度使得LLM专家能够主动搜索，而不是被动合并，而个人/全局最佳检查点有助于跟踪权重空间中良好位置和邻域，以便进一步探索。

Step 0. Initialize To expand the pool of starting experts/particles ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ ,MODEL SWARMS employs pairwise crossover with linear interpolation. Concretely, we randomly select two experts ${\mathbf{x}}_{a}$ and ${\mathbf{x}}_{b}$ from ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ and sample $t \sim \mathcal{U}\left( {0,1}\right)$ ,a new starting particle is obtained by ${\mathbf{x}}_{\text{new }} = t{\mathbf{x}}_{a} + \left( {1 - t}\right) {\mathbf{x}}_{b}$ . Repeat this process for $N - n$ times to expand ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ into ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{N}$ . Expanding the starting particles allows for more trial-and-error bandwidth in the search process.

步骤0. 初始化为了扩展起始专家/粒子池 ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ ，MODEL SWARMS 采用线性插值的成对交叉。具体来说，我们从 ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ 中随机选择两个专家 ${\mathbf{x}}_{a}$ 和 ${\mathbf{x}}_{b}$ ，并采样 $t \sim \mathcal{U}\left( {0,1}\right)$ ，通过 ${\mathbf{x}}_{\text{new }} = t{\mathbf{x}}_{a} + \left( {1 - t}\right) {\mathbf{x}}_{b}$ 获得一个新的起始粒子。重复此过程 $N - n$ 次，将 ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ 扩展为 ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{N}$ 。扩展起始粒子允许在搜索过程中有更多的试错带宽。

For each particle ${\mathbf{x}}_{i}$ ,we initialize its velocity as pointing to a random particle ${\mathbf{v}}_{i} =$ $\operatorname{random}\left( {\left\{ {\mathbf{x}}_{j}\right\} }_{j = 1}^{N}\right) - {\mathbf{x}}_{i}{.}^{ * }$ We initialize its personal best as its current location ${\mathbf{p}}_{i} = {\mathbf{x}}_{i}$ and determine the global best/worst as $\mathbf{g} = \arg \mathop{\max }\limits_{\mathbf{x}}f\left( \mathbf{x}\right)$ and ${\mathbf{g}}_{w} = \arg \mathop{\min }\limits_{\mathbf{x}}f\left( \mathbf{x}\right) ,\mathbf{x} \in {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ .

对于每个粒子 ${\mathbf{x}}_{i}$ ，我们将其速度初始化为指向随机粒子 ${\mathbf{v}}_{i} =$ $\operatorname{random}\left( {\left\{ {\mathbf{x}}_{j}\right\} }_{j = 1}^{N}\right) - {\mathbf{x}}_{i}{.}^{ * }$ 。将其个人最佳初始化为当前位置 ${\mathbf{p}}_{i} = {\mathbf{x}}_{i}$ ，并将全局最佳/最差确定为 $\mathbf{g} = \arg \mathop{\max }\limits_{\mathbf{x}}f\left( \mathbf{x}\right)$ 和 ${\mathbf{g}}_{w} = \arg \mathop{\min }\limits_{\mathbf{x}}f\left( \mathbf{x}\right) ,\mathbf{x} \in {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ 。

Step 1. Velocity Update The movement of LLM experts is mainly governed by velocity $\mathbf{v}$ ,directions in the weight space. We posit that the weight neighborhoods of good model checkpoints might be promising to explore (Eilertsen et al.,2020),thus the velocity of particles ${\mathbf{v}}_{i}$ is iteratively drawn by personal best ${\mathbf{p}}_{i}$ ,global best $\mathbf{g}$ ,and repelled by global worst ${\mathbf{g}}_{w}$ . Concretely:

步骤1. 速度更新 LLM专家的运动主要由速度 $\mathbf{v}$ 、权重空间中的方向控制。我们假设良好模型检查点的权重邻域可能值得探索（Eilertsen et al., 2020），因此粒子 ${\mathbf{v}}_{i}$ 的速度通过个人最佳 ${\mathbf{p}}_{i}$ 、全局最佳 $\mathbf{g}$ 迭代地被吸引，并通过全局最差 ${\mathbf{g}}_{w}$ 被排斥。具体来说：

where $\mathcal{C} = {r}_{v}{\phi }_{v} + {r}_{p}{\phi }_{p} + {r}_{g}{\phi }_{g} + {r}_{w}{\phi }_{w}$ is a normalization term. To dissect this formula:

其中 $\mathcal{C} = {r}_{v}{\phi }_{v} + {r}_{p}{\phi }_{p} + {r}_{g}{\phi }_{g} + {r}_{w}{\phi }_{w}$ 是一个归一化项。为了剖析这个公式：

The new velocity is the weighted average of four factors: ${\mathbf{v}}_{i}$ ,the particle keeps some of its current velocity (i.e. inertia); $\left( {{\mathbf{p}}_{i} - {\mathbf{x}}_{i}}\right)$ ,it is drawn towards its personal best; $\left( {\mathbf{g} - {\mathbf{x}}_{i}}\right)$ ,drawn towards the global best; $- \left( {{\mathbf{g}}_{w} - {\mathbf{x}}_{i}}\right)$ ,repelled from the global worst. Inertia enables each expert to chart an independent search path, personal/global best terms encourage experts to explore good weight neighborhoods, while the global worst term repels experts to stay clear of bad model checkpoints.
新的速度是四个因素的加权平均值： ${\mathbf{v}}_{i}$ ，粒子保留了其当前速度的一部分（即惯性）； $\left( {{\mathbf{p}}_{i} - {\mathbf{x}}_{i}}\right)$ ，它被吸引向其个人最佳位置； $\left( {\mathbf{g} - {\mathbf{x}}_{i}}\right)$ ，被吸引向全局最佳位置； $- \left( {{\mathbf{g}}_{w} - {\mathbf{x}}_{i}}\right)$ ，被排斥远离全局最差位置。惯性使每个专家能够规划独立的搜索路径，个人/全局最佳项鼓励专家探索良好的权重邻域，而全局最差项则排斥专家以避免不良模型检查点。

*This is to avoid all particles collapsing into the global best $\mathbf{g}$ like a "black hole" and reduce exploration.

*这是为了避免所有粒子像“黑洞”一样坍缩到全局最佳位置 $\mathbf{g}$ ，从而减少探索。

${}^{ \dagger }$ All particles perform velocity and location update in parallel,we omit the time stamp $k$ for brevity.

${}^{ \dagger }$ 所有粒子并行执行速度和位置更新，为简洁起见，我们省略了时间戳 $k$ 。

Hyperparameters - inertia ${\phi }_{v}$ ,cognitive coefficient ${\phi }_{p}$ ,social coefficient ${\phi }_{g}$ ,repel coefficient ${\phi }_{w}$ , all $\in \left\lbrack {0,1}\right\rbrack$ - are configurable and govern how much the search process is impacted by ${\mathbf{p}}_{i},\mathbf{g}$ ,and ${\mathbf{g}}_{w}$ . In particular,inertia ${\phi }_{v}$ has a unique control over exploration,where lower ${\phi }_{v}$ means more exploration (less impacted by current velocity and more by other models) and vice versa.
超参数 - 惯性 ${\phi }_{v}$ ，认知系数 ${\phi }_{p}$ ，社会系数 ${\phi }_{g}$ ，排斥系数 ${\phi }_{w}$ ，所有 $\in \left\lbrack {0,1}\right\rbrack$ - 都是可配置的，并决定了搜索过程受 ${\mathbf{p}}_{i},\mathbf{g}$ 和 ${\mathbf{g}}_{w}$ 影响的程度。特别是，惯性 ${\phi }_{v}$ 对探索有独特的控制作用，较低的 ${\phi }_{v}$ 意味着更多的探索（受当前速度影响较小，受其他模型影响较大），反之亦然。
Walk randomness factors ${r}_{v},{r}_{p},{r}_{g},{r}_{w} \sim \mathcal{U}\left( {0,1}\right)$ ensure that the search is not deterministic, boosting particle exploration and are crucial in the collaborative search process (Table 5).
行走随机性因子 ${r}_{v},{r}_{p},{r}_{g},{r}_{w} \sim \mathcal{U}\left( {0,1}\right)$ 确保搜索不是确定性的，增强了粒子的探索能力，在协作搜索过程中至关重要（表5）。

Step 2. Weight Update Based on velocity $\mathbf{v}$ ,the weights/locations of LLM experts are updated by taking a step towards $\mathbf{v} : {\mathbf{x}}_{i} \leftarrow {\mathbf{x}}_{i} + \lambda {\mathbf{v}}_{i}$ ,where $\lambda$ is the step length hyperparameter. The updated particles are then evaluated on the utility function $f$ to update $\mathbf{g},{\mathbf{g}}_{w}$ ,and ${\left\{ {\mathbf{p}}_{i}\right\} }_{i = 1}^{N}$ ,if necessary.

步骤2. 基于速度 $\mathbf{v}$ 更新权重，LLM专家的权重/位置通过向 $\mathbf{v} : {\mathbf{x}}_{i} \leftarrow {\mathbf{x}}_{i} + \lambda {\mathbf{v}}_{i}$ 迈进一步来更新，其中 $\lambda$ 是步长超参数。然后，更新后的粒子在效用函数 $f$ 上进行评估，以更新 $\mathbf{g},{\mathbf{g}}_{w}$ 和 ${\left\{ {\mathbf{p}}_{i}\right\} }_{i = 1}^{N}$ （如有必要）。

Since MODEL SWARMS explicitly encourage randomness and exploration, particles might sometimes fail to find desirable locations and stray away. We propose to restart undesirable particles and give them another chance: concretely,if for particle $i$ the personal best ${\mathbf{p}}_{i}$ didn’t change in ${c}_{r}$ iterations,where ${c}_{r}$ is a hyperparameter,we put the particle back to its personal-best location with ${\mathbf{x}}_{i} \leftarrow {\mathbf{p}}_{i}$ and ${\mathbf{v}}_{i} \leftarrow \mathbf{0}$ ,essentially granting the particle another chance with a relatively good starting point. In this way, MODEL SWARMS strikes a balance between exploration and robustness.

由于 MODEL SWARMS 明确鼓励随机性和探索，粒子有时可能无法找到理想位置并偏离轨道。我们提出重新启动不理想的粒子并给予它们另一次机会：具体来说，如果对于粒子 $i$ ，其个人最佳 ${\mathbf{p}}_{i}$ 在 ${c}_{r}$ 次迭代中没有改变，其中 ${c}_{r}$ 是一个超参数，我们将粒子放回其个人最佳位置，并赋予其 ${\mathbf{x}}_{i} \leftarrow {\mathbf{p}}_{i}$ 和 ${\mathbf{v}}_{i} \leftarrow \mathbf{0}$ ，本质上给予粒子一个相对较好的起点，再次尝试。通过这种方式，MODEL SWARMS 在探索和鲁棒性之间取得了平衡。

Step 3. End of Iteration If the global best $\mathbf{g}$ hasn’t changed in $c$ iterations (patience hyperparam-eter) or the maximum iteration of $\mathcal{K}$ is achieved,the search process ends. Otherwise the step length $\lambda$ is reduced by a hyperparameter factor ${\phi }_{\lambda },\lambda \leftarrow \lambda \times {\phi }_{\lambda }$ ,and goes back to step 1 . In the end,the global best expert $\mathbf{g}$ is returned as the product of MODEL SWARMS.

步骤 3. 迭代结束如果全局最佳 $\mathbf{g}$ 在 $c$ 次迭代中没有改变（耐心超参数），或者达到了最大迭代次数 $\mathcal{K}$ ，搜索过程结束。否则，步长 $\lambda$ 将按超参数因子 ${\phi }_{\lambda },\lambda \leftarrow \lambda \times {\phi }_{\lambda }$ 减少，并返回到步骤 1。最终，全局最佳专家 $\mathbf{g}$ 作为 MODEL SWARMS 的产物返回。

3 EXPERIMENT SETTINGS

3 实验设置

Models and Implementation We implement a prototype of MODEL SWARMS with GEMMA- 7B (google/gemma-7b-it) (Gemma Team et al., 2024) in the main paper, while we also employ other LLMs such as MISTRAL-7B (Jiang et al., 2023a) in Table 7. We create a pool of 10 initial experts/particles by fine-tuning GEMMA-7B separately on the 10 SFT data domains ${}^{ \ddagger }$ in Tulu-v2 (Ivison et al., 2023) with LoRA (Hu et al., 2022). We fine-tune for 5 epochs with a starting learning rate of 2e-4 and effective batch size of 32 by default. For MODEL SWARMS searches, we employ $N = {20},{\phi }_{\lambda } = {0.95},p = {10},{p}_{r} = 5,\mathcal{K} = {50}$ ,while running grid search over other hyperparameters and report the best-found expert based on utility function $f$ .

模型与实现我们在主论文中使用 GEMMA-7B（google/gemma-7b-it）（Gemma 团队等，2024）实现了 MODEL SWARMS 的原型，同时在表7中还使用了其他大型语言模型，如 MISTRAL-7B（Jiang 等，2023a）。我们通过在 Tulu-v2（Ivison 等，2023）的10个 SFT 数据域 ${}^{ \ddagger }$ 上分别微调 GEMMA-7B，创建了一个包含10个初始专家/粒子的池。我们默认进行了5个 epoch 的微调，起始学习率为2e-4，有效批量大小为32。对于 MODEL SWARMS 搜索，我们采用了 $N = {20},{\phi }_{\lambda } = {0.95},p = {10},{p}_{r} = 5,\mathcal{K} = {50}$ ，同时在其他超参数上进行网格搜索，并根据效用函数 $f$ 报告最佳发现的专家。

Baselines We compare with 12 model composition baselines in three categories.

基线我们与三类共12种模型组合基线进行了比较。

Trivial composition,1) Best Single expert,essentially $\arg \mathop{\max }\limits_{\mathbf{x}}f\left( \mathbf{x}\right)$ for $\mathbf{x} \in {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ ; 2) Data Merge, where the 10 SFT data domains in Tulu-v2 are merged to train one single expert; 3) Prediction Merge,where the predictions of ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ are ensembled via plurality vote (if applicable).
简单组合，1) 最佳单一专家，本质上为 $\mathbf{x} \in {\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ 的 $\arg \mathop{\max }\limits_{\mathbf{x}}f\left( \mathbf{x}\right)$ ；2) 数据合并，将 Tulu-v2 中的10个 SFT 数据域合并以训练一个单一专家；3) 预测合并，通过多数投票（如果适用）对 ${\left\{ {\mathbf{x}}_{i}\right\} }_{i = 1}^{n}$ 的预测结果进行集成。
Static composition, where the composed expert is independent of the adaptation task/utility function $f$ . We evaluate Uniform Soup (Wortsman et al.,2022a),Slerp,Dare-Ties (Yu et al.,2024; Yadav et al., 2024), and Model Stocks (Jang et al., 2024).
静态组合，其中组合专家独立于适应任务/效用函数 $f$ 。我们评估了均匀汤（Wortsman 等，2022a）、Slerp、Dare-Ties（Yu 等，2024；Yadav 等，2024）和模型股票（Jang 等，2024）。
Dynamic composition,where the composed expert changes based on the utility function $f$ . We evaluate Greedy Soup (Wortsman et al., 2022a), Pack of LLMs (Mavromatis et al., 2024), cBTM (Gururangan et al., 2023), EvolMerge (Akiba et al., 2024), and LoraHub (Huang et al., 2023). These approaches are also guided by the utility function $f$ like MODEL SWARMS.
动态组合，其中组合的专家根据效用函数 $f$ 变化。我们评估了 Greedy Soup（Wortsman 等人，2022a）、Pack of LLMs（Mavromatis 等人，2024）、cBTM（Gururangan 等人，2023）、EvolMerge（Akiba 等人，2024）和 LoraHub（Huang 等人，2023）。这些方法也像 MODEL SWARMS 一样由效用函数 $f$ 引导。

Data and Evaluation We investigate whether MODEL SWARMS could adapt LLM experts via collaborative search on four types of adaptation objectives and the corresponding utility functions.

数据与评估我们研究 MODEL SWARMS 是否可以通过在四种适应目标及其相应的效用函数上的协作搜索来适应 LLM 专家。

Single task: we employ 9 datasets spanning knowledge (MMLU (Hendrycks et al., 2021), MMLU-pro (Wang et al., 2024e), Hellaswag (Zellers et al., 2019)), reasoning (GSM8k (Cobbe et al., 2021), Knowledge Crosswords (Ding et al., 2024), NLGraph (Wang et al., 2024a; Zhang et al., 2024b)), and safety (TruthfulQA (Lin et al., 2022), RealToxicityPrompts (Gehman et al., 2020), AbstainQA (Feng et al., 2024a)). We by default randomly sample 200 and 1000 samples as the validation/test sets: the utility function $f$ is defined as performance on the validation set.
单一任务：我们使用了涵盖知识（MMLU（Hendrycks 等人，2021）、MMLU-pro（Wang 等人，2024e）、Hellaswag（Zellers 等人，2019））、推理（GSM8k（Cobbe 等人，2021）、知识填字游戏（Ding 等人，2024）、NLGraph（Wang 等人，2024a；Zhang 等人，2024b））和安全（TruthfulQA（Lin 等人，2022）、RealToxicityPrompts（Gehman 等人，2020）、AbstainQA（Feng 等人，2024a））的 9 个数据集。我们默认随机抽取 200 和 1000 个样本作为验证/测试集：效用函数 $f$ 定义为验证集上的表现。

${}^{ \ddagger }$ We replace the GPT-4 Alpaca subset with Gemini-distilled Alpaca and remove the hardcoded subset.

${}^{ \ddagger }$ 我们将 GPT-4 Alpaca 子集替换为 Gemini 提炼的 Alpaca 并移除硬编码子集。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——