Doc2X:PDF 转 Word 的最佳选择 Doc2X 专注于 PDF转Word、PDF转Latex、PDF转HTML,支持 Mathpix公式识别、多栏解析、GLM翻译 等强大功能,助您快速完成文档处理! Doc2X: The Best Choice for PDF to Word Conversion Doc2X specializes in PDF to Word, PDF to LaTeX, and PDF to HTML, with features like Mathpix formula recognition, multi-column parsing, and GLM translation, making document processing faster! 👉 立即试用 Doc2X | Try Doc2X Now
HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning
HydraLoRA: 一种用于高效微调的不对称LoRA架构
Chunlin Tian
田春林
University of Macau
澳门大学
Zhan Shi
石展
University of Texas at Austin
德克萨斯大学奥斯汀分校
Zhijjiang Guo
郭志江
University of Cambridge
剑桥大学
Li
李
University of Macau
澳门大学
Chengzhong Xu
徐成忠
University of Macau
澳门大学
Abstract
摘要
Adapting Large Language Models (LLMs) to new tasks through fine-tuning has been made more efficient by the introduction of Parameter-Efficient Fine-Tuning (PEFT) techniques, such as LoRA. However, these methods often underperform compared to full fine-tuning, particularly in scenarios involving complex datasets. This issue becomes even more pronounced in complex domains, highlighting the need for improved PEFT approaches that can achieve better performance. Through a series of experiments, we have uncovered two critical insights that shed light on the training and parameter inefficiency of LoRA. Building on these insights, we have developed HydraLoRA, a LoRA framework with an asymmetric structure that eliminates the need for domain expertise. Our experiments demonstrate that HydraLoRA outperforms other PEFT approaches, even those that rely on domain knowledge during the training and inference phases.
通过微调将大型语言模型(LLMs)适应新任务,由于引入了参数高效微调(PEFT)技术,如LoRA,效率得到了提高。然而,与全量微调相比,这些方法在处理复杂数据集时往往表现不佳。在复杂领域中,这一问题变得更加明显,突显了改进PEFT方法以实现更好性能的必要性。通过一系列实验,我们揭示了LoRA训练和参数效率低下的两个关键见解。基于这些见解,我们开发了HydraLoRA,这是一个具有非对称结构的LoRA框架,消除了对领域专业知识的需求。我们的实验表明,HydraLoRA优于其他PEFT方法,即使在训练和推理阶段依赖领域知识的方法也不例外。
1 Introduction
1 引言
Large Language Models (LLMs; [10, 3, 35, 45, 46, 31, 32]) are notably powerful, yet their training involves substantial expense. Adapting a single LLM for multiple downstream applications via fine-tuning has emerged as a prevalent method to cater to specific domain needs, balancing performance with practicality. This approach, however, faces a significant challenge due to the extensive memory and computational resources required for full fine-tuning (FFT), i.e., fine-tuning all billions of parameters. A solution to this has been the development of more selective adaptation techniques, involving modifying only a portion of the parameters or integrating external modules designed for new tasks. Key methodologies in this sphere include LoRA [17], Adaptors [36, 16, 30], and many other variants [24, 23, 9, 13, 51], all part of what can be generally termed as Parameter-Efficient Fine-tuning (PEFT). PEFT strategies are characterized by freezing the backbone model parameters while only a minimal number of task-specific parameters are introduced and fine-tuned. This method substantially boosts efficiency in the phases of fine-tuning and subsequent deployment, marking a significant advancement in the practical use of LLMs.
大型语言模型(LLMs;[10, 3, 35, 45, 46, 31, 32])具有显著的强大能力,但其训练涉及巨大的成本。通过微调将单个LLM适应于多个下游应用已成为一种普遍的方法,以满足特定领域需求,平衡性能与实用性。然而,这种方法面临一个重大挑战,即全量微调(FFT)所需的广泛内存和计算资源,即微调所有数十亿参数。对此,已经开发出更具选择性的适应技术,涉及仅修改部分参数或集成为新任务设计的外部模块。这一领域的主要方法包括LoRA [17]、Adaptors [36, 16, 30]及其许多其他变体 [24, 23, 9, 13, 51],这些方法通常被称为参数高效微调(PEFT)。PEFT策略的特点是冻结主干模型参数,同时仅引入和微调少量任务特定参数。这种方法在微调和后续部署阶段显著提高了效率,标志着LLMs实际应用的重大进步。
While fine-tuning a small subset of parameters offers a streamlined approach for domain adaptation, it's well-recognized that model performance is closely tied to the number of parameters involved [21]. This intrinsic characteristic of methods like LoRA often results in them falling short of the FFT baseline, which updates all parameters, thereby creating a trade-off between efficiency and model quality. This issue of compromised quality in a low-parameter setting becomes even more pronounced in target domains characterized by complex sub-domains and diverse tasks. This situation presents a compelling research question:
尽管微调一小部分参数为领域适应提供了一种简化的方法,但众所周知,模型性能与涉及的参数数量密切相关 [21]。像LoRA这样的方法的这种内在特性往往导致它们无法达到FFT基线,后者更新所有参数,从而在效率和模型质量之间产生权衡。在具有复杂子领域和多样化任务的目标领域中,这种在低参数设置下质量受损的问题变得更加明显。这种情况提出了一个引人注目的研究问题:
Figure 1: Illustration of LoRA architecture changes in HydraLoRA. Only the tunable parameters are shown in this Figure. (a) LoRA architecture with matrix A to achieve low rank and matrix B to recover. (b) under the same parameter count, a monolithic LoRA is split into multiple smaller A and B matrices to avoid training interference. (c) based on (b), HydraLoRA has an asymmetric structure that has a shared A matrix and multiple B matrices.
图1:HydraLoRA架构变化的示意图。本图中仅显示了可调参数。(a) LoRA架构,使用矩阵A实现低秩,矩阵B用于恢复。(b) 在相同参数数量下,单个LoRA被拆分为多个较小的A和B矩阵,以避免训练干扰。(c) 基于(b),HydraLoRA具有不对称结构,共享一个A矩阵和多个B矩阵。
What is the optimal architecture that can deliver superior model performance while still capitalizing on the efficiency benefits of a reduced parameter footprint?
什么样的架构能够在保持参数减少带来的效率优势的同时,提供卓越的模型性能?
In our research, we carry out a series of exploratory experiments, applying LoRA to the LLaMA2 [46] model to adapt it to a new domain encompassing multiple downstream tasks. As shown in Figure 1(a), LoRA adds trainable pairs of rank decomposition matrices A and B in addition to existing weight matrices. Our in-depth analysis of LoRA's mechanics yields several insightful observations and leads to the formulation of key hypotheses. First, rather than employing a single LoRA for the entire domain, it proves more effective to deploy multiple, smaller LoRA heads, each dedicated to a specific downstream task (see Figure 1(b)). This suggests that domain or task interference might harmfully impact the training process. We further hypothesize that this interference originates from "intrinsic components"—sub-domains or distinct tasks—potentially unknown even to domain experts. Additionally, upon visualizing the parameters of LoRA, we discern a pattern: some parameters predominantly learn the commonalities across all data, while others focus on the unique aspects of each intrinsic component. From these observations, we posit that an optimal LoRA architecture should embody an explicit, asymmetric structure.
在我们的研究中,我们进行了一系列探索性实验,将LoRA应用于LLaMA2 [46]模型,以使其适应包含多个下游任务的新领域。如图1(a)所示,LoRA在现有权重矩阵之外添加了可训练的秩分解矩阵A和B对。我们对LoRA机制的深入分析得出了几个有见地的观察结果,并形成了关键假设。首先,与在整个领域使用单一LoRA相比,部署多个针对特定下游任务的小型LoRA头更为有效(见图1(b))。这表明领域或任务干扰可能对训练过程产生有害影响。我们进一步假设,这种干扰源于“内在成分”——子领域或不同任务——这些成分甚至对领域专家来说也可能是未知的。此外,在可视化LoRA参数时,我们发现了一种模式:一些参数主要学习所有数据中的共性,而另一些则专注于每个内在成分的独特方面。基于这些观察,我们提出,最优的LoRA架构应体现明确的非对称结构。
Building upon the observations, we propose an improved end-to-end LoRA framework, which we refer to as HydraLoRA. From the architecture perspective, unlike LoRA's symmetric structure, HydraLoRA has an asymmetric structure that has a shared A matrix and multiple B matrices (see Figure 1(c)). The shared A matrix is used by all samples for parameter efficiency. During the fine-tuning phase, HydraLoRA is designed to autonomously identify "intrinsic components" and segregate training samples into distinct B matrices. During the inference phase, HydraLoRA leverages multiple B matrices using Mixture-of-Experts (MoE; [19, 39]) manner. Unlike prior work, HydraLoRA completely eliminates the need for human expertise and assumptions, showing better performance than using domain knowledge to guide the fine-tuning process.
基于这些观察,我们提出了一种改进的端到端 LoRA 框架,我们称之为 HydraLoRA。从架构角度来看,与 LoRA 的对称结构不同,HydraLoRA 具有一种非对称结构,它有一个共享的 A 矩阵和多个 B 矩阵(见图 1(c))。共享的 A 矩阵用于所有样本以提高参数效率。在微调阶段,HydraLoRA 旨在自主识别“内在成分”并将训练样本分离到不同的 B 矩阵中。在推理阶段,HydraLoRA 利用多个 B 矩阵采用混合专家(MoE; [19, 39])的方式。与先前的工作不同,HydraLoRA 完全消除了对人类专业知识和假设的需求,显示出比使用领域知识指导微调过程更好的性能。
2 Background and Motivation
2 背景与动机
2.1 LoRA Basics
2.1 LoRA 基础
LoRA [17] achieves comparable performances to fine-tuning on many benchmarks by freezing the pre-trained model weights and inserting trainable rank decomposition matrices into each layer of the pre-trained model. In particular,for each layer,LoRA uses two sequential low-rank matrices and to fit the residual weights for adaptation. The forward computation is written as follows:
LoRA [17] 通过冻结预训练模型的权重 并在预训练模型的每一层中插入可训练的秩分解矩阵,在许多基准测试中实现了与微调相当的性能。具体来说,对于每一层,LoRA 使用两个顺序的低秩矩阵 和 来拟合适应的残差权重。前向计算如下所示:
where is the output and the denotes the input. with . Normally matrix is initialized with zeroes and matrix is initialized with Kaiming Uniform [14] to force at the beginning.
其中 是输出, 表示输入。 带有 。通常矩阵 用零初始化,矩阵 用 Kaiming Uniform [14] 初始化,以在开始时强制 。
2.2 LoRA's Practical Dilemma
2.2 LoRA 的实际困境
Parameter count has a clear impact on the performance of neural models [21, 32]. Yet, Parameter-Efficient Fine-tuning (PEFT) methods, such as Adapter [16] and prefix-tuning [24], focus on fine-tuning a limited set of parameters. These approaches present a practical dilemma: while restricting the number of tuned parameters is essential for training efficiency, it hinders the model's ability to learn from diverse datasets. This trade-off becomes particularly evident when considering corpus heterogeneity [2]. Figure 2 reveals a notable performance disparity between PEFT techniques and full fine-tuning (FFT), with the gap widening in scenarios involving a more diverse or heterogeneous training corpus.
参数数量对神经模型的性能有明显影响 [21, 32]。然而,参数高效微调(PEFT)方法,如适配器 [16] 和前缀微调 [24],侧重于微调有限数量的参数。这些方法提出了一个实际的困境:虽然限制微调参数的数量对于训练效率至关重要,但它阻碍了模型从多样化数据集中学习的能力。这种权衡在考虑语料库异质性 [2] 时尤为明显。图2显示了PEFT技术与全量微调(FFT)之间的显著性能差异,在涉及更多样化或异质训练语料库的情况下,差距进一步扩大。
Figure 2: Performance impact of corpus heterogeneity on full fine-tuning vs. parameter-efficient fine-tuning. Heterogeneity signifies the diversity within the dataset, often leading to interference due to its varied content and style [2]. Parameter-efficient approaches are particularly sensitive, suffering greater performance losses in heterogeneous cases.
图2:语料库异质性对全量微调与参数高效微调性能的影响。异质性表示数据集内部的多样性,通常由于其内容和风格的多样性而导致干扰 [2]。参数高效方法特别敏感,在异质情况下性能损失更大。
Table 1: Performance on instruction tuning with Dolly-15K [8] and evaluated with MMLU [15] with different ranks. For LoRA (Split) decomposes high-rank LoRA modules into smaller, equivalent low-rank components is the number of LoRAs, denotes the rank of each LoRA.
表1:在Dolly-15K [8] 上进行指令微调的性能,并使用MMLU [15] 进行不同秩的评估。对于LoRA(Split)将高秩LoRA模块分解为更小、等效的低秩组件 是LoRA的数量, 表示每个LoRA的秩。
2.3 Observations
2.3 观察结果
In this work, we aim for a PEFT approach that strikes a better balance between maximizing the learning capability for heterogeneous data and minimizing the number of parameters involved. A key goal is to ensure that our enhanced technique exhibits robust generalization across unseen tasks, independent of any prior task-specific knowledge. To achieve our objectives, we focus on LoRA and conduct a series of experiments as Table 1 to gain a deeper understanding of its mechanisms. Our methodology involves leveraging data from diverse tasks within a domain, and training distinct LoRA heads for each domain, leading to our first observation:
在这项工作中,我们的目标是实现一种PEFT方法,该方法在最大化异质数据学习能力和最小化涉及参数数量之间取得更好的平衡。一个关键目标是确保我们的增强技术在未见任务中表现出强大的泛化能力,而不依赖于任何先前的任务特定知识。为了实现我们的目标,我们专注于LoRA,并进行了一系列实验,如表1所示,以深入了解其机制。我们的方法涉及利用域内多样化任务的数据,并为每个域训练不同的LoRA头,从而得出我们的第一个观察结果:
Observation I: With the same parameter count, rather than employing a single LoRA for the entire domain dataset, it proves more effective to deploy multiple, smaller LoRA heads, each dedicated to a specific downstream task.
观察 I:在相同参数数量的情况下,与在整个领域数据集上使用单一 LoRA 相比,部署多个较小的 LoRA 头,每个头专门用于特定的下游任务,被证明更为有效。
This suggests that interference among tasks might harmfully impact the training process. Furthermore, we posit that this interference is NOT exclusive to this explicit multi-task training. This interference could happen in any training setting since all datasets inherently consist of multiple implicit intrinsic components, such as sub-domains or tasks within a domain that is even unknown to domain experts. To better understand how multiple LoRA heads mitigate the interference among intrinsic components, in Figure 3, we employ the t-SNE technique [47] to visualize the parameters of matrix A and B across all heads. This analysis yields another critical observation:
这表明任务间的干扰可能会对训练过程产生有害影响。此外,我们假设这种干扰并不仅限于这种显式的多任务训练。由于所有数据集本质上都包含多个隐含的内在组成部分,例如领域内的子领域或任务,甚至这些对领域专家来说也是未知的,因此这种干扰可能发生在任何训练环境中。为了更好地理解多个 LoRA 头如何减轻内在组成部分之间的干扰,在图 3 中,我们采用了 t-SNE 技术 [47] 来可视化所有头中矩阵 A 和 B 的参数。这一分析得出了另一个关键观察结果:
Observation II: When multiple LoRA heads are trained individually on different data, the parameters of matrix from different heads tend to converge,while those of matrix are distinguishable.
观察 II:当多个 LoRA 头分别在不同数据上进行训练时,矩阵 的参数在不同头之间趋于收敛,而矩阵 的参数则可区分。
In detail, the parameters of matrix A across all heads exhibit a high degree of similarity, leading to their overlaps in the figure. Conversely, the parameters of matrix B from different heads are distinct and easily distinguishable. We posit that this divergence is an artifact of the initialization schemes, with matrix A inclined toward capturing commonalities across domains, while matrix B adapts to domain-specific diversities. The distinction between matrix A and B offers valuable insights for enhancing both parameter efficiency and effectiveness. From an efficiency standpoint, our hypothesis suggests that the parameters of matrix A could potentially be shared across multiple heads, thereby reducing redundancy. Regarding effectiveness, since the parameters of matrix B of different heads are dispersed, suggesting that using a single head to adapt to multiple domains might be less effective than using individual heads for each domain, which minimizes the interference between domains.
具体来说,所有头中矩阵 A 的参数表现出高度相似性,导致它们在图中重叠。相反,不同头中矩阵 B 的参数是独特的且易于区分。我们假设这种差异是由于初始化方案造成的,矩阵 A 倾向于捕捉跨领域的共性,而矩阵 B 则适应于领域特定的多样性。矩阵 A 和 B 之间的区别为提高参数效率和有效性提供了宝贵的见解。从效率的角度来看,我们的假设表明矩阵 A 的参数可能可以在多个头之间共享,从而减少冗余。关于有效性,由于不同头的矩阵 B 参数分散,这表明使用单个头来适应多个领域可能不如为每个领域使用单独的头有效,后者最小化了领域间的干扰。
Building upon our observations, we propose an optimized LoRA architecture designed to enhance cost-effectiveness. In this architecture, we share the parameters of A matrix across various subdomains or tasks to improve parameter efficiency, while deploying multiple B matrices, each tailored
基于我们的观察,我们提出了一种优化的LoRA架构,旨在提高成本效益。在这种架构中,我们跨不同子领域或任务共享A矩阵的参数,以提高参数效率,同时部署多个B矩阵,每个矩阵都经过定制
Figure 3: Breakdown analysis of LoRA modules. Compare fine-tuned LoRA modules of Dolly-15K [8] with three subtasks of Dolly-15K including "summarization (Sum)", "closed QA (QA)" and "information extraction (IE)" using t-SNE. Consider LLaMA2-7B (random seed=42), which contains 32 decoder layers, corresponding to 32 adaptive modules. Each module consists of {0: q_proj of A, 1: q_proj of B, 2: v_proj of A, 3: v_proj of B} submodules. This makes a total of submodules. Left displays all submodules. Center shows all even submodules, i.e. the A matrix. Right represents all odd submodules, i.e. the B matrix. It can be seen that the differences in the fine-tuned LoRA modules for different tasks arise mainly from the B matrix.
图3:LoRA模块的分解分析。比较Dolly-15K [8]的微调LoRA模块与Dolly-15K的三个子任务,包括“总结(Sum)”、“封闭问答(QA)”和“信息提取(IE)”,使用t-SNE进行分析。考虑LLaMA2-7B(随机种子=42),它包含32个解码层,对应32个自适应模块。每个模块由{0: A的q_proj, 1: B的q_proj, 2: A的v_proj, 3: B的v_proj}子模块组成。这总共构成了个子模块。左侧显示所有子模块。中间显示所有偶数子模块,即A矩阵。右侧表示所有奇数子模块,即B矩阵。可以看出,不同任务的微调LoRA模块的差异主要来自B矩阵。
to handle different intrinsic components. This design allows for a more effective adaptation to the specific characteristics of each component. While these intrinsic components can be manually identified using prior knowledge of the training data, we also introduce end-to-end methods using Mixture-of-Experts (MoEs) [20], which will be detailed in the methodology section. This automatic approach facilitates flexibility and applicability, particularly in scenarios where prior knowledge is limited or unavailable.
以处理不同的内在组成部分。这种设计允许更有效地适应每个组成部分的特定特征。虽然这些内在组成部分可以使用训练数据的先验知识手动识别,但我们也引入了使用专家混合(Mixture-of-Experts, MoEs)[20]的端到端方法,这将在方法论部分详细介绍。这种自动方法促进了灵活性和适用性,特别是在先验知识有限或不可用的情况下。
3 HydraLoRA
3 HydraLoRA
In this section, we introduce the proposed HydraLoRA, an asymmetric LoRA architecture for efficient fine-tuning, as illustrated in Figure 1 After that, we show the workflow of HydraLoRA as Figure 4
在本节中,我们介绍了所提出的HydraLoRA,这是一种用于高效微调的不对称LoRA架构,如图1所示。之后,我们展示了HydraLoRA的工作流程,如图4所示
3.1 Asymmetric LoRA architecture
3.1 不对称LoRA架构
The LoRA method updates two low-rank matrices and ,and uses as the change of a pretrained and frozen weight of a linear layer as shown in Eq. 1 The integral parameters are fine-tuned for the whole corpus in the original LoRA, which causes difficulty in learning the various knowledge aspects. Drawing from a detailed breakdown analysis of LoRA, a potential solution is to segment the entire LoRA into "Hydra" structured LoRA variants, that is, characterized by a central shared matrix and several distinct matrices ,fostering a blend of shared knowledge and specialized functionalities. As Figure 1, HydraLoRA is to fine-tune LoRAs to achieve robust performance without redundancy, thereby benefiting the entire heterogeneous corpus. The asymmetric LoRA architecture can be formulated as:
LoRA 方法更新两个低秩矩阵 和 ,并使用 作为预训练和冻结权重 的变化,如公式 1 所示。在原始 LoRA 中,整体参数针对整个语料库进行微调,这导致了在学习各种知识方面存在困难。通过对 LoRA 的详细分析,一个潜在的解决方案是将整个 LoRA 分割为“Hydra”结构的 LoRA 变体,即以一个中央共享矩阵 和几个不同的矩阵 为特征,促进共享知识和专业化功能的融合。如图 1 所示,HydraLoRA 旨在微调 LoRA 以实现无冗余的鲁棒性能,从而使整个异构语料库受益。非对称 LoRA 架构可以表述为:
The matrics and shared . The hyper-parameter denotes the number of matrices. The term modulates these contribution weights for head .
矩阵 和共享 。超参数 表示 矩阵的数量。术语 调节头部 的贡献权重。
3.2 Workflow of HydraLoRA
3.2 HydraLoRA 的工作流程
Figure 4 illustrates the workflow of HydraLoRA. Initially, HydraLoRA delves into the adaptive identification and initialization of LoRA modules within a heterogeneous corpus, aligning them with task relevance through the application of -means or developer-specified size. Subsequently,we propose a Mixture-of-Experts (MoE) framework that handles matrices as expert adapters to ensure computational efficiency throughout the fine-tuning (Section 3.2.1) and inference (Section 3.2.2) stages by freezing the rest of the LLM parameters. During inference, it flexibly and dynamically merges multiple matrices through the MoE router.
图 4 展示了 HydraLoRA 的工作流程。最初,HydraLoRA 深入研究了在异构语料库中 LoRA 模块的自适应识别和初始化,通过应用 -means 或开发者指定的大小来与任务相关性对齐。随后,我们提出了一种混合专家(MoE)框架,该框架将 矩阵作为专家适配器,以确保在整个微调(第 3.2.1 节)和推理(第 3.2.2 节)阶段计算效率,同时冻结其余 LLM 参数。在推理过程中,它通过 MoE 路由器灵活且动态地合并多个 矩阵。
—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——