[论文翻译]TOKENFORMER: RETHINKING TRANSFORMER SCAL- ING WITH TOKENIZED MODEL PARAMET

批量处理 PDF 文档，就选 Doc2X 支持大规模 PDF 转 Word、Markdown、HTML，集成表格与多栏解析，提升工作效率。 Batch Process PDFs with Doc2X Handle large-scale PDF to Word, Markdown, or HTML conversions with integrated table and multi-column parsing for better efficiency. 👉 点击查看 Doc2X | View Doc2X

原文链接：arxiv.org/pdf/2410.23…

TOKENFORMER: RETHINKING TRANSFORMER SCAL- ING WITH TOKENIZED MODEL PARAMETERS

TOKENFORMER：重新思考使用标记化模型参数的变换器扩展

Haiyang Wang ${}^{1,3}$ ,Yue Fan ${}^{1}$ ,Muhammad Ferjad Naeem ${}^{2}$ ,Yongqin Xian ${}^{2}$ ,

Jan Eric Lenssen ${}^{1}$ ,Liwei Wang ${}^{3}$ ,Federico Tombari ${}^{2}$ ,Bernt Schiele ${}^{1}$

${}^{1}$ Max Planck Institute for Informatics ${}^{2}$ Google ${}^{3}$ Peking University

${}^{1}$ 马克斯·普朗克信息学研究所 ${}^{2}$ 谷歌 ${}^{3}$ 北京大学

{haiwang, schiele}@mpi-inf.mpg.de

ABSTRACT

摘要

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce Tokenformer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from ${124}\mathrm{M}$ to ${1.4}\mathrm{\;B}$ parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at

变换器已成为基础模型中的主要架构，因为它们在各个领域表现出色。然而，扩展这些模型的巨大成本仍然是一个重要问题。这个问题主要源于它们对线性投影中固定参数数量的依赖。当引入架构修改（例如，通道维度）时，整个模型通常需要从头开始重新训练。随着模型规模的不断增长，这种策略导致计算成本越来越高，变得不可持续。为了解决这个问题，我们引入了Tokenformer，这是一种本质上可扩展的架构，利用注意力机制不仅用于输入标记之间的计算，还用于标记与模型参数之间的交互，从而增强了架构的灵活性。通过将模型参数视为标记，我们用我们的标记-参数注意力层替换了变换器中的所有线性投影，其中输入标记作为查询，模型参数作为键和值。这种重新表述允许逐步和高效的扩展，而无需从头开始重新训练。我们的模型通过逐步添加新的键-值参数对，从 ${124}\mathrm{M}$ 扩展到 ${1.4}\mathrm{\;B}$ 参数，达到了与从头训练的变换器相当的性能，同时大大降低了训练成本。代码和模型可在

github.com/Haiyang-W/T…

1 INTRODUCTION

1 引言

Designing a powerful neural network architecture is a long-standing goal in machine learning. Recent developments in foundation models (FMs) have shown the potential of Transformers (Vaswani et al. 2017) as a universal computational architecture. Thanks to their flexibility and scalability, Transformers have achieved state-of-the-art performance across various domains, including natural language processing (NLP) (Radford et al. 2018; Alec et al., 2019; Brown et al., 2020), visual modeling (Dosovitskiy et al., 2021; Liu et al., 2021), vision-language (Liu et al., 2023; Wang et al., 2024), graph representation (Ying et al., 2021), and 3D vision (Wang et al., 2023a b).

设计一个强大的神经网络架构是机器学习中的一个长期目标。基础模型（FMs）的最新发展显示了变换器（Vaswani et al. 2017）作为一种通用计算架构的潜力。由于其灵活性和可扩展性，变换器在多个领域中实现了最先进的性能，包括自然语言处理（NLP）（Radford et al. 2018；Alec et al.，2019；Brown et al.，2020）、视觉建模（Dosovitskiy et al.，2021；Liu et al.，2021）、视觉-语言（Liu et al.，2023；Wang et al.，2024）、图表示（Ying et al.，2021）和3D视觉（Wang et al.，2023a b）。

Transformers typically divide the computation required to process a single token into two distinct parts: interactions with other input tokens (token-token interaction) and computations involving the model's parameters (token-parameter interaction). The attention mechanism (Vaswani et al. 2017) facilitates token-token interactions, allowing modern general-purpose foundation models to encode multi-modal data into a unified token sequence and effectively capture complex dependencies among them (Liu et al., 2023; Zhu et al., 2023; Wang et al., 2023d). Conversely, token-parameter computations rely heavily on linear projections (Dunford & Schwartz, 1988), where input tokens are multiplied by a fixed set of parameters. This prescribed design limits scalability because increasing the model size requires altering core architectural components, often necessitating retraining the entire model from scratch. As models grow larger, this results in excessive resource consumption, making it increasingly impractical. In this paper, we introduce a novel architecture that enhances the flexibility of token-parameter interactions, allowing for incremental scaling of model parameters and effectively reusing previously trained models, thus significantly reducing the training burden.

Transformers 通常将处理单个标记所需的计算分为两个不同的部分：与其他输入标记的交互（标记-标记交互）和涉及模型参数的计算（标记-参数交互）。注意机制（Vaswani et al. 2017）促进了标记-标记交互，使现代通用基础模型能够将多模态数据编码为统一的标记序列，并有效捕捉它们之间的复杂依赖关系（Liu et al., 2023; Zhu et al., 2023; Wang et al., 2023d）。相反，标记-参数计算在很大程度上依赖于线性投影（Dunford & Schwartz, 1988），其中输入标记与一组固定的参数相乘。这种规定的设计限制了可扩展性，因为增加模型大小需要改变核心架构组件，通常需要从头开始重新训练整个模型。随着模型规模的增大，这导致资源消耗过多，使其变得越来越不切实际。在本文中，我们引入了一种新颖的架构，增强了标记-参数交互的灵活性，允许模型参数的增量扩展，并有效地重用先前训练的模型，从而显著降低训练负担。

Figure 1: Traditionally, large transformer architectures are trained from scratch without reusing previous smaller-scale models (represented by blue dots on the left). In this paper, we propose a novel fully attention-based architecture that allows scaling model incrementally, thus greatly reducing the overall cost of training large transformer architectures (depicted by red dots on the left). The right panel delineates a comparison between conventional Transformer and our Tokenformer.

图1：传统上，大型变压器架构是从头开始训练的，而不重用先前的小规模模型（左侧用蓝点表示）。在本文中，我们提出了一种新颖的完全基于注意力的架构，允许模型的增量扩展，从而大大降低训练大型变压器架构的整体成本（左侧用红点表示）。右侧面板 delineates 了传统变压器与我们的 Tokenformer 之间的比较。

To achieve this objective, we introduce Tokenformer, a novel architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism. The flexibility of our token-parameter attention layer, along with its ability to handle a variable number of parameters, inherently enhances the model's scalability, facilitating progressively efficient scaling.

为了实现这一目标，我们引入了 Tokenformer，这是一种新颖的架构，通过完全采用注意机制统一了标记-标记和标记-参数交互的计算。我们标记-参数注意层的灵活性，以及其处理可变数量参数的能力，固有地增强了模型的可扩展性，促进了逐步高效的扩展。

As shown in Figure 1, we extend the Transformer architecture by preserving the computational patterns between input tokens while reformulating all the linear projections using a cross-attention mechanism. Specifically,to project features with input and output dimensions ${D}_{1}$ and ${D}_{2}$ ,we employ two sets of parameters,each comprising $N$ learnable tokens with channel dimensions of ${D}_{1}$ and ${D}_{2}$ , respectively. In this formulation, input tokens serve as queries, and model parameters as keys and values. This flexibility renders our model’s parameters inherently scalable with variable $N$ ,allowing for efficient expansion by continuously adding new key-value parameter pairs. Figure 1 shows that our model can be scaled incrementally from ${124}\mathrm{M}$ to ${1.4}\mathrm{\;B}$ parameters,achieving performance similar to training from scratch while saving more than half of the training cost.

如图1所示，我们通过保留输入标记之间的计算模式，同时使用交叉注意机制重新构造所有线性投影，从而扩展了Transformer架构。具体而言，为了投影具有输入和输出维度 ${D}_{1}$ 和 ${D}_{2}$ 的特征，我们采用两组参数，每组包含 $N$ 个可学习的标记，通道维度分别为 ${D}_{1}$ 和 ${D}_{2}$ 。在这个公式中，输入标记作为查询，模型参数作为键和值。这种灵活性使我们的模型参数在本质上能够随着可变的 $N$ 进行扩展，从而通过不断添加新的键值参数对实现高效扩展。图1显示，我们的模型可以从 ${124}\mathrm{M}$ 增量扩展到 ${1.4}\mathrm{\;B}$ 参数，达到与从头训练相似的性能，同时节省了超过一半的训练成本。

The key contributions of this work are summarized as 1) As shown in Figure 1, we propose Token-former, a fully attention-driven neural network that treats model parameters as tokens, maximizing the flexibility of token-parameter computations while achieving competitive performance on standard benchmarks across both language and vision domains. 2) Thanks to this design, our model can be naturally scaled by progressively adding new key-value parameter pairs. Compared with the train-from-scratch approach (Biderman et al. 2023, Kaplan et al. 2020), our method achieves nearly the same performance while greatly reducing training costs.

本工作的关键贡献总结如下：1）如图1所示，我们提出了Token-former，这是一种完全基于注意力的神经网络，将模型参数视为标记，最大化了标记-参数计算的灵活性，同时在语言和视觉领域的标准基准上实现了具有竞争力的性能。2）得益于这种设计，我们的模型可以通过逐步添加新的键值参数对自然地扩展。与从头训练的方法（Biderman et al. 2023，Kaplan et al. 2020）相比，我们的方法在性能上几乎相同，同时大幅降低了训练成本。

2 Related Work

2 相关工作

Transformer (Vaswani et al. 2017) has emerged as a foundational architecture in deep learning due to its versatile attention mechanism, enabling it to process any tokenized data and adapt to numerous domains, including language modeling (Radford et al., 2018) Touvron et al., 2023), image processing (Dosovitskiy et al. 2021), multi-modal understanding (Liu et al. 2023; Wang et al. 2024, 2023b, 2022), decision making (Chen et al., 2021b), graph learning (Yun et al., 2019), among others. While the Transformer effectively handles interactions among input tokens with flexibility, this property does not extend to computations involving model parameters, which are conducted via prescribed linear projections. In this work, we seek to restructure token-parameter interactions by developing a fully attention-based network that unifies both token-token and token-parameter computations through attention mechanisms, thus further extending the network's flexibility.

Transformer (Vaswani et al. 2017) 已成为深度学习中的基础架构，因其灵活的注意力机制，使其能够处理任何标记化数据，并适应多个领域，包括语言建模 (Radford et al., 2018; Touvron et al., 2023)、图像处理 (Dosovitskiy et al. 2021)、多模态理解 (Liu et al. 2023; Wang et al. 2024, 2023b, 2022)、决策制定 (Chen et al., 2021b)、图学习 (Yun et al., 2019) 等。虽然 Transformer 能够灵活地处理输入标记之间的交互，但这一特性并不适用于涉及模型参数的计算，这些计算是通过规定的线性投影进行的。在本研究中，我们旨在通过开发一个完全基于注意力的网络来重构标记与参数之间的交互，该网络通过注意力机制统一标记-标记和标记-参数的计算，从而进一步扩展网络的灵活性。

Large Scale Training has proven to be an effective approach for developing powerful foundation models. As demonstrated by models like the GPT series (Radford et al. 2018; Alec et al. 2019; Brown et al. 2020), simple architectures-when supported by larger training datasets and increased model sizes (measured in parameters)-often outperform more complex algorithms. Scaling up data is generally more cost-effective because it is independent of the model's architecture and allows for the continuous integration of new data through fine-tuning existing models (Kaplan et al. 2020). In contrast, increasing the model size often incurs extremely high costs, as it alters architectural details and usually requires retraining the entire dataset from scratch at each scaling step (Biderman et al. 2023). This significantly raises the expenses for building progressively larger models in the industry.

大规模训练已被证明是一种有效的方法，用于开发强大的基础模型。如 GPT 系列模型所示 (Radford et al. 2018; Alec et al. 2019; Brown et al. 2020)，简单架构在更大的训练数据集和增加的模型规模（以参数计量）的支持下，往往优于更复杂的算法。扩大数据规模通常更具成本效益，因为它独立于模型的架构，并允许通过微调现有模型持续集成新数据 (Kaplan et al. 2020)。相比之下，增加模型规模通常会产生极高的成本，因为这会改变架构细节，并且通常需要在每次扩展步骤中从头开始重新训练整个数据集 (Biderman et al. 2023)。这显著提高了在行业中构建逐渐更大模型的费用。

Model Reusing. Previous methods for reusing models have typically involved initializing larger models with pre-trained smaller models by duplicating (Chen et al. 2015, 2021a), stacking (Gong et al. 2019), or combining (Wang et al. 2023c) model weights. While these approaches can be effective, they often disturb the pre-established distribution of the smaller model, increasing the risk of losing pre-trained knowledge and slowing convergence. In contrast, our model allows for parameter scaling in a natural and seamless manner and preserves the integrity of the existing model.

模型重用。以往的模型重用方法通常涉及通过复制（Chen et al. 2015, 2021a）、堆叠（Gong et al. 2019）或组合（Wang et al. 2023c）模型权重来初始化较大的模型。这些方法虽然有效，但往往会扰乱较小模型的预先建立的分布，增加丢失预训练知识的风险，并减缓收敛速度。相比之下，我们的模型允许以自然无缝的方式进行参数缩放，并保持现有模型的完整性。

3 Methodology

3 方法论

In this section, we first revisits the conventional attention mechanism in Section 3.1. Then, Section 3.2 introduces Tokenformer, a natively scalable architecture centered around a flexible token-parameter attention layer. Finally, incremental model scaling of Tokenformer is detailed in Section 3.3

在本节中，我们首先在3.1节回顾传统的注意力机制。然后，3.2节介绍了Tokenformer，这是一种以灵活的令牌参数注意力层为中心的原生可扩展架构。最后，3.3节详细介绍了Tokenformer的增量模型缩放。

3.1 Preliminaries

3.1 基础知识

Transformer models (Vaswani et al. 2017) have established themselves as fundamental architectures in deep learning, demonstrating outstanding performance across a wide range of tasks. The cornerstone of their success is the self-attention mechanism, which allows the model to dynamically assess the importance of each token, efficiently modeling complex dependencies among them.

Transformer模型（Vaswani et al. 2017）已确立为深度学习中的基础架构，在广泛的任务中表现出色。它们成功的基石是自注意力机制，该机制使模型能够动态评估每个令牌的重要性，有效建模它们之间的复杂依赖关系。

Given a set of $T$ input tokens $X \in {\mathbb{R}}^{T \times d}$ with channel dimension $d$ ,the self-attention block first derives input-dependent query $Q$ ,key $K$ ,and value $V$ ,with three distinct linear projections as

给定一组 $T$ 输入令牌 $X \in {\mathbb{R}}^{T \times d}$ ，其通道维度为 $d$ ，自注意力块首先通过三个不同的线性投影推导出依赖于输入的查询 $Q$ 、键 $K$ 和值 $V$ 。

where the ${W}^{Q},{W}^{K} \in {\mathbb{R}}^{d \times {d}_{k}}$ and ${W}^{V} \in {\mathbb{R}}^{d \times {d}_{v}}$ are learnable weight matrices. The attention scores are calculated by measuring the similarity between query and key vectors, followed by a softmax function to obtain normalized weights. These scores are subsequently used to compute the output of the scaled dot-product attention as,

其中 ${W}^{Q},{W}^{K} \in {\mathbb{R}}^{d \times {d}_{k}}$ 和 ${W}^{V} \in {\mathbb{R}}^{d \times {d}_{v}}$ 是可学习的权重矩阵。注意力分数通过测量查询和键向量之间的相似性来计算，随后通过softmax函数获得归一化权重。这些分数随后用于计算缩放点积注意力的输出，如下所示，

where $\sqrt{d}$ is a scale factor for alleviating small gradients caused by softmax. Finally,the output is,

其中 $\sqrt{d}$ 是用于缓解softmax引起的小梯度的缩放因子。最后，输出为，

with ${X}_{\text{att }}$ being the attention output and ${W}^{O} \in {\mathbb{R}}^{{d}_{v} \times d}$ as the output projection matrix.

其中 ${X}_{\text{att }}$ 是注意力输出， ${W}^{O} \in {\mathbb{R}}^{{d}_{v} \times d}$ 是输出投影矩阵。

The above architectural design enables the model to flexibly manage interactions between tokens of varying lengths, thereby allowing modern general models to concurrently process any form and quantity of tokenized multi-modal data. This capability markedly enhances the development of current AI domain and is fundamental to the success of transformer-based systems.

上述架构设计使模型能够灵活管理不同长度的标记之间的交互，从而使现代通用模型能够同时处理任何形式和数量的标记化多模态数据。这一能力显著增强了当前人工智能领域的发展，并且是基于变换器系统成功的基础。

3.2 TOKENFORMER

Although transformers excel across various domains, their scalability is limited by high computational overheads resulting from prescribed token-parameter interactions (i.e., linear projections). As a result, scaling strategies that adjust architectural components (e.g., channel dimensions) typically require retraining the entire model from the beginning, leading to inefficient use of computational resources.

尽管变换器在各个领域表现出色，但其可扩展性受到由规定的标记-参数交互（即线性投影）所导致的高计算开销的限制。因此，调整架构组件（例如通道维度）的扩展策略通常需要从头开始重新训练整个模型，导致计算资源的低效使用。

To overcome this challenge, we propose Tokenformer, an architecture entirely based on attention mechanisms. The central innovation of Tokenformer is token-Parameter attention (Pattention) layer, which incorporates a set of trainable tokens functioning as model parameters and then employs cross-attention to manage interactions between input tokens and these parameter tokens. In this way, the Pattention layer introduces an additional dimension-the number of parameter tokens-which operates independently of the input and output channel dimensions. This decoupling enables input data to dynamically interact with a variable number of parameters, providing the flexibility required for incremental model scaling by reusing pre-trained models. Consequently, training larger models is greatly accelerated while achieving performance on par with transformers trained from scratch.

为了克服这一挑战，我们提出了 Tokenformer，这是一种完全基于注意力机制的架构。Tokenformer 的核心创新是标记-参数注意力（Pattention）层，它包含一组可训练的标记，作为模型参数，然后采用交叉注意力来管理输入标记与这些参数标记之间的交互。通过这种方式，Pattention 层引入了一个额外的维度——参数标记的数量——该维度独立于输入和输出通道维度运作。这种解耦使输入数据能够与可变数量的参数动态交互，从而提供了通过重用预训练模型进行增量模型扩展所需的灵活性。因此，训练更大模型的速度大大加快，同时在性能上与从头开始训练的变换器相当。

Figure 2: Tokenformer is a fully attention-driven architecture featuring a new token-Parameter attention (Pattention) layer. The Pattention uses a set of learnable tokens to represent model parameters and lets the input tokens attend to them. As the model scales, Tokenformer adds new learnable tokens to expand the existing key-value parameter sets, while keeping the feature dimension constant and leaving the rest of the computation unaffected.

图 2：Tokenformer 是一种完全基于注意力的架构，具有新的标记-参数注意力（Pattention）层。Pattention 使用一组可学习的标记来表示模型参数，并使输入标记能够关注这些参数。随着模型的扩展，Tokenformer 添加新的可学习标记，以扩展现有的键值参数集，同时保持特征维度不变，并且不影响其余的计算。

Pattention Layer. Let the input tokens and output tokens be represented as $\mathcal{I} \in {\mathbb{R}}^{T \times {d}_{1}}$ and $\mathcal{O} \in {\mathbb{R}}^{T \times {d}_{2}}$ ,where $T$ is the sequence length,and ${d}_{1}$ and ${d}_{2}$ are the input and output dimensions, respectively. To implement our Pattention mechanism,we introduce two sets of $n$ learnable parameter tokens: ${K}_{P} \in {\mathbb{R}}^{n \times {d}_{1}}$ representing the keys,and ${V}_{P} \in {\mathbb{R}}^{n \times {d}_{2}}$ representing the values. The output $\mathcal{O}$ from the scaled dot-product Pattention layer is computed as:

Pattention 层。让输入令牌和输出令牌分别表示为 $\mathcal{I} \in {\mathbb{R}}^{T \times {d}_{1}}$ 和 $\mathcal{O} \in {\mathbb{R}}^{T \times {d}_{2}}$ ，其中 $T$ 是序列长度， ${d}_{1}$ 和 ${d}_{2}$ 分别是输入和输出维度。为了实现我们的 Pattention 机制，我们引入两组 $n$ 可学习参数令牌： ${K}_{P} \in {\mathbb{R}}^{n \times {d}_{1}}$ 表示键， ${V}_{P} \in {\mathbb{R}}^{n \times {d}_{2}}$ 表示值。来自缩放点积 Pattention 层的输出 $\mathcal{O}$ 计算如下：

where $\Theta$ is a modified softmax operation for stable optimization of Pattention layer. The output Pattention scores, $S \in {\mathbb{R}}^{n \times n}$ ,are formulated as,

其中 $\Theta$ 是用于稳定优化 Pattention 层的修改版 softmax 操作。输出的 Pattention 分数 $S \in {\mathbb{R}}^{n \times n}$ 被公式化为，

where $A$ is the score derived from $\left( {X \cdot {K}_{P}^{\top }}\right) ,\tau$ is the scale factor,which is set to $\sqrt{n}$ by default, and $f$ is a non-linearity function,which in our formulation is set to the GeLU function (Hendrycks & Gimpel, 2016). This design improves gradient stability in our architecture and results in better performance compared to the standard softmax operation (see Appendix A and Table 4 for details).

其中 $A$ 是从 $\left( {X \cdot {K}_{P}^{\top }}\right) ,\tau$ 得出的分数， $\left( {X \cdot {K}_{P}^{\top }}\right) ,\tau$ 是缩放因子，默认设置为 $\sqrt{n}$ ，而 $f$ 是一个非线性函数，在我们的公式中设置为 GeLU 函数（Hendrycks & Gimpel, 2016）。该设计提高了我们架构中的梯度稳定性，并与标准 softmax 操作相比，带来了更好的性能（详见附录 A 和表 4）。

Our Pattention layer employs a cross-attention mechanism to manage interactions between tokens and parameters, thereby fully preserving the adaptability characteristic of attention mechanisms. Similar to how self-attention in Transformer models handles sequences with variable lengths, our Pattention layer is designed to process a flexible number of parameters independently of the input and output channel dimensions used in feature projection. This allows network parameters to be expanded seamlessly along the parameter token axis, enabling the effective reuse of pre-trained weights and offering a naturally incremental manner for model scaling.

我们的 Pattention 层采用交叉注意机制来管理令牌和参数之间的交互，从而充分保留注意机制的适应性特征。类似于 Transformer 模型中的自注意力如何处理可变长度的序列，我们的 Pattention 层被设计为独立于用于特征投影的输入和输出通道维度处理灵活数量的参数。这使得网络参数能够沿参数令牌轴无缝扩展，从而有效地重用预训练权重，并为模型扩展提供了一种自然的增量方式。

Overall Architecture. Figure 2 illustrates the architecture of Tokenformer. Given the input tokens ${X}_{\text{in }} \in {\mathbb{R}}^{T \times d}$ ,we follow the design of the pre-norm transformer,the computation for the output of a Tokenformer layer is represented as follows:

整体架构。图 2 说明了 Tokenformer 的架构。给定输入令牌 ${X}_{\text{in }} \in {\mathbb{R}}^{T \times d}$ ，我们遵循预归一化 Transformer 的设计，Tokenformer 层的输出计算表示如下：

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——