[论文翻译]FAN: Fourier Analysis Networks

148 阅读14分钟

Doc2X:科研翻译与解析工具 提供 批量PDF处理、公式解析、多栏识别,以及 GPT 翻译 与 深度语料提取功能。 Doc2X: Research Translation and Parsing Tool Offers batch PDF processing, formula parsing, multi-column recognition, along with GPT translation and corpus extraction. 👉 立即使用 Doc2X | Use Doc2X Now

原文链接:arxiv.org/pdf/2410.02…

FAN: Fourier Analysis Networks

FAN: 傅里叶分析网络

Yihong Dong 1{}^{1 * } ,Ge Li 1{}^{1 * } ,Yongding Tao 1{}^{1} ,Xue Jiang 1{}^{1} ,Kechi Zhang 1{}^{1} ,Jia Li 1{}^{-1} , Jing Su2{\mathrm{{Su}}}^{2} ,Jun Zhang 2{}^{2} ,Jingjing Xu 2{}^{2}

Yihong Dong 1{}^{1 * } ,Ge Li 1{}^{1 * } ,Yongding Tao 1{}^{1} ,Xue Jiang 1{}^{1} ,Kechi Zhang 1{}^{1} ,Jia Li 1{}^{-1} , Jing Su2{\mathrm{{Su}}}^{2} ,Jun Zhang 2{}^{2} ,Jingjing Xu 2{}^{2}

1{}^{1} School of Computer Science,Peking University 2{}^{2} ByteDance

1{}^{1} 北京大学计算机科学学院 2{}^{2} 字节跳动

dongyh@stu.pku.edu.cn, lige@pku.edu.cn

dongyh@stu.pku.edu.cn, lige@pku.edu.cn

Abstract

摘要

Despite the remarkable success achieved by neural networks, particularly those represented by MLP and Transformer, we reveal that they exhibit potential flaws in the modeling and reasoning of periodicity, i.e., they tend to memorize the periodic data rather than genuinely understanding the underlying principles of periodicity. However, periodicity is a crucial trait in various forms of reasoning and generalization, underpinning predictability across natural and engineered systems through recurring patterns in observations. In this paper, we propose FAN, a novel network architecture based on Fourier Analysis, which empowers the ability to efficiently model and reason about periodic phenomena. By introducing Fourier Series, the periodicity is naturally integrated into the structure and computational processes of the neural network, thus achieving a more accurate expression and prediction of periodic patterns. As a promising substitute to multi-layer perceptron (MLP), FAN can seamlessly replace MLP in various models with fewer parameters and FLOPs. Through extensive experiments, we demonstrate the effectiveness of FAN in modeling and reasoning about periodic functions, and the superiority and generalizability of FAN across a range of real-world tasks, including symbolic formula representation, time series forecasting, and language modeling.

尽管神经网络,特别是多层感知器(MLP)和变换器(Transformer),取得了显著成功,但我们揭示它们在周期性建模和推理方面存在潜在缺陷,即它们倾向于记忆周期性数据,而不是对周期性的基本原理进行真正理解。然而,周期性是各种推理和泛化形式中的一个关键特征,通过观察中的重复模式支撑着自然和工程系统的可预测性。在本文中,我们提出了FAN,一种基于傅里叶分析的新型网络架构,它增强了有效建模和推理周期现象的能力。通过引入傅里叶级数,周期性自然地融入了神经网络的结构和计算过程,从而实现了对周期模式的更准确表达和预测。作为多层感知器(MLP)的有前景替代品,FAN可以在各种模型中无缝替代MLP,并且参数和FLOPs更少。通过广泛的实验,我们证明了FAN在建模和推理周期函数方面的有效性,以及FAN在包括符号公式表示、时间序列预测和语言建模等一系列现实任务中的优越性和泛化能力。

1 Introduction

1 引言

The flourishing of modern machine learning and artificial intelligence is inextricably linked to the revolutionary advancements in the foundational architecture of neural networks. For instance, multilayer perceptron (MLP) (Rosenblatt, 1958, Haykin, 1998) plays a pivotal role in laying the groundwork for current deep learning models, with its expressive power guaranteed by the universal approximation theorem (Hornik et al. 1989). Recent claims about the impressive performance of large models on various tasks are typically supported by Transformer architecture (Vaswani et al. 2017; Touvron et al. 2023, OpenAI, 2023). In this context, the community's enthusiasm for research on neural networks has never diminished. Some emerged neural networks demonstrate notable capabilities in specific fields (Gu & Dao, 2023; Liu et al., 2024), sparking widespread discussion within the community.

现代机器学习和人工智能的蓬勃发展与神经网络基础架构的革命性进展密不可分。例如,多层感知器(MLP)(Rosenblatt, 1958; Haykin, 1998)在当前深度学习模型的基础奠定中发挥了关键作用,其表达能力得到了通用逼近定理(Hornik et al. 1989)的保证。关于大型模型在各种任务上表现出色的最新说法通常得到了Transformer架构(Vaswani et al. 2017; Touvron et al. 2023, OpenAI, 2023)的支持。在这种背景下,社区对神经网络研究的热情从未减退。一些新兴的神经网络在特定领域表现出显著能力(Gu & Dao, 2023; Liu et al., 2024),引发了社区内的广泛讨论。

Beneath the surface of apparent prosperity, we uncover a critical issue that remains in existing neural networks: they struggle to model the periodicity from data. We showcase this issue through an empirical study as illustrated in Figure 1. The results indicate that existing neural networks, including MLP (Rosenblatt, 1958), KAN (Liu et al., 2024), and Transformer (Vaswani et al., 2017), face difficulties in fitting periodic functions, even on a simple sine function. Although they demonstrate

在表面繁荣之下,我们揭示了现有神经网络中存在的一个关键问题:它们难以从数据中建模周期性。我们通过实证研究展示了这一问题,如图1所示。结果表明,现有的神经网络,包括MLP(Rosenblatt, 1958)、KAN(Liu et al., 2024)和Transformer(Vaswani et al., 2017),在拟合周期函数时面临困难,即使是在简单的正弦函数上。


*Equal Contribution

*同等贡献

{}^{ \dagger } This work was supported by a cooperation project between Peking University and ByteDance Company. During this time, Yihong was also an intern at ByteDance.

{}^{ \dagger } 本研究得到了北京大学与字节跳动公司合作项目的支持。在此期间,Yihong 也在字节跳动实习。

{}^{ \ddagger } The code is available at github.com/YihongDong/…

{}^{ \ddagger } 代码可在 github.com/YihongDong/… 获取


Figure 1: The performance of different neural networks within and outside the domain of their training data for the sine function,where xx is a scalar variable.

图1:不同神经网络在其训练数据域内外对正弦函数的性能,其中 xx 是一个标量变量。

proficiency in interpolation within the domain of training data, they tend to falter when faced with extrapolation challenges of test data, especially in the out-of-domain (OOD) scenarios. Therefore, their generalization capacity is primarily dictated by the scale and diversity of the training data, rather than by the learned principles of periodicity to perform reasoning. We argue that periodicity is an essential characteristic in various forms of reasoning and generalization, as it provides a basis for predictability in many natural and engineered systems by leveraging recurring patterns in observations.

在训练数据领域内的插值能力方面,它们在面对测试数据的外推挑战时往往会出现问题,尤其是在域外(OOD)场景中。因此,它们的泛化能力主要受到训练数据的规模和多样性的影响,而不是通过学习的周期性原则来进行推理。我们认为,周期性是各种推理和泛化形式中的一个重要特征,因为它通过利用观察中的重复模式,为许多自然和工程系统提供了可预测性的基础。

In this paper, we investigate a key research problem: How to enable neural networks to model periodicity? One core reason existing neural networks fail to model periodicity is that they heavily rely on data-driven optimization without explicit mechanisms to understand the underlying principles in the data. To this end, we propose a Fourier Analysis Network (FAN), a novel neural network framework based on Fourier Analysis. By leveraging the power of Fourier Series, we explicitly encode periodic patterns within the neural network, offering a way to model the general principles from the data. FAN holds great potential as a substitute to traditional MLP, which not only exhibits exceptional capabilities in periodicity modeling but also demonstrates competitive or superior effects on general tasks.

在本文中,我们研究一个关键的研究问题:如何使神经网络能够建模周期性?现有神经网络未能建模周期性的一个核心原因是它们在很大程度上依赖于数据驱动的优化,而没有明确的机制来理解数据中的基本原则。为此,我们提出了一种傅里叶分析网络(FAN),这是一种基于傅里叶分析的新型神经网络框架。通过利用傅里叶级数的力量,我们在神经网络中明确编码周期性模式,为从数据中建模一般原则提供了一种方法。FAN作为传统多层感知器(MLP)的替代品具有巨大的潜力,不仅在周期性建模方面表现出色,而且在一般任务上也展现出竞争力或更优的效果。

To verify the effectiveness of FAN, we conduct extensive experiments from two main aspects: periodicity modeling and application of real-world tasks. 1) For periodicity modeling, FAN achieves significant improvements in fitting both basic and complex periodic functions, compared to existing neural networks (including MLP, KAN, and Transformer), particularly in OOD scenarios. 2) FAN demonstrates superior performance in real-world tasks, including symbolic formula representation, time series forecasting, and language modeling. The experimental results indicate that FAN outperform baselines (including MLP, KAN, and Transformer) for symbolic formula representation task, and Transformer with FAN surpasses the competing models (including Transformer, LSTM (Hochreiter & Schmidhuber, 1997), and Mamba (Gu & Dao, 2023)), for time series forecasting and language modeling tasks. As a promising substitute to MLP, FAN improves the model's generalization performance meanwhile reducing the number of parameters and floating point of operations (FLOPs) employed. We believe FAN is promising to be an important part of the fundamental model backbone.

为了验证 FAN 的有效性,我们从两个主要方面进行广泛实验:周期性建模和现实世界任务的应用。1) 在周期性建模方面,与现有的神经网络(包括 MLP、KAN 和 Transformer)相比,FAN 在拟合基本和复杂周期函数方面取得了显著改善,特别是在 OOD 场景中。2) FAN 在现实世界任务中表现出色,包括符号公式表示、时间序列预测和语言建模。实验结果表明,FAN 在符号公式表示任务中优于基线(包括 MLP、KAN 和 Transformer),而结合 FAN 的 Transformer 在时间序列预测和语言建模任务中超越了竞争模型(包括 Transformer、LSTM(Hochreiter & Schmidhuber, 1997)和 Mamba(Gu & Dao, 2023))。作为 MLP 的一种有前景的替代方案,FAN 在提高模型的泛化性能的同时减少了所需的参数数量和浮点运算(FLOPs)。我们相信 FAN 有望成为基础模型骨干的重要组成部分。

2 Preliminary Knowledge

2 初步知识

Fourier Analysis (Stein & Weiss, 1971; Duoandikoetxea, 2024) is a mathematical framework that decomposes functions into their constituent frequencies, revealing the underlying periodic structures within complex functions. At the heart of this analysis lies Fourier Series (Tolstov, 2012), which expresses a periodic function as an infinite sum of sine and cosine terms. Mathematically, for a function f(x)f\left( x\right) ,its Fourier Series expansion can be represented as:

傅里叶分析(Stein & Weiss, 1971;Duoandikoetxea, 2024)是一个数学框架,将函数分解为其组成频率,揭示复杂函数中的潜在周期结构。这一分析的核心是傅里叶级数(Tolstov, 2012),它将一个周期函数表示为正弦和余弦项的无限和。从数学上讲,对于一个函数 f(x)f\left( x\right) ,其傅里叶级数展开可以表示为:

where TT is the period of the function,and the coefficients an{a}_{n} and bn{b}_{n} are determined by integrating the function over one period:

其中 TT 是函数的周期,而系数 an{a}_{n}bn{b}_{n} 通过在一个周期内对函数进行积分来确定:

The power of Fourier Series lies in its ability to represent a wide variety of functions, including nonperiodic functions through periodic extensions, enabling the extraction of frequency components. Building on this mathematical foundation, FAN aims to embed the periodic characteristics directly into network architecture, enhancing generalization capabilities and performance on various tasks, particularly in scenarios requiring the identification of patterns and regularities.

傅里叶级数的力量在于它能够表示各种函数,包括通过周期扩展表示的非周期函数,从而提取频率成分。在此数学基础上,傅里叶分析网络(FAN)旨在将周期特性直接嵌入网络架构中,增强在各种任务上的泛化能力和性能,特别是在需要识别模式和规律的场景中。

3 Fourier Analysis Network (FAN)

3 傅里叶分析网络(FAN)

In this section, we first construct a simple neural network modeled by the formula of Fourier Series, and then on this basis, we design FAN and provide its details. Finally, we discuss the difference between the FAN layer and the MLP layer.

在本节中,我们首先构建一个简单的神经网络,该网络以傅里叶级数的公式为模型,然后在此基础上设计FAN并提供其详细信息。最后,我们讨论FAN层与MLP层之间的区别。

Consider a task involving input-output pairs {xi,yi}\left\{ {{x}_{i},{y}_{i}}\right\} ,with the objective of identifying a function f(x):RdxRdyf\left( x\right) : {\mathbb{R}}^{{d}_{x}} \rightarrow {\mathbb{R}}^{{d}_{y}} that approximates the relationship such that yif(xi){y}_{i} \approx f\left( {x}_{i}\right) for all xi{x}_{i} ,where dx{d}_{x} and dy{d}_{y} denote the dimensions of xx and yy ,respectively. To build a simple neural network fS(x){f}_{\mathrm{S}}\left( x\right) that represents Fourier Series expansion of the function,specifically F{f(x)}\mathcal{F}\{ f\left( x\right) \} ,as described in Eq. (1), we can express fS(x){f}_{\mathrm{S}}\left( x\right) as follows:

考虑一个涉及输入输出对 {xi,yi}\left\{ {{x}_{i},{y}_{i}}\right\} 的任务,目标是识别一个函数 f(x):RdxRdyf\left( x\right) : {\mathbb{R}}^{{d}_{x}} \rightarrow {\mathbb{R}}^{{d}_{y}},该函数近似于关系,使得对于所有 xi{x}_{i},都有 yif(xi){y}_{i} \approx f\left( {x}_{i}\right),其中 dx{d}_{x}dy{d}_{y} 分别表示 xxyy 的维度。为了构建一个简单的神经网络 fS(x){f}_{\mathrm{S}}\left( x\right),该网络表示函数的傅里叶级数展开,具体为 F{f(x)}\mathcal{F}\{ f\left( x\right) \},如公式(1)所述,我们可以将 fS(x){f}_{\mathrm{S}}\left( x\right) 表达为:

where BRdy,Win RN×dxB \in {\mathbb{R}}^{{d}_{y}},{W}_{\text{in }} \in {\mathbb{R}}^{N \times {d}_{x}} ,and Wout Rdy×2N{W}_{\text{out }} \in {\mathbb{R}}^{{d}_{y} \times {2N}} are learnable parameters,(I) follows that the computation of an{a}_{n} and bn{b}_{n} computed via Eq. 2 is definite integral,(II) and (III) follows the equivalence of the matrix operations, []\left\lbrack {\cdot \left| \right| \cdot }\right\rbrack and [,]\left\lbrack {\cdot , \cdot }\right\rbrack denotes the concatenation along the first and second dimension, respectively.

其中 BRdy,Win RN×dxB \in {\mathbb{R}}^{{d}_{y}},{W}_{\text{in }} \in {\mathbb{R}}^{N \times {d}_{x}}Wout Rdy×2N{W}_{\text{out }} \in {\mathbb{R}}^{{d}_{y} \times {2N}} 是可学习的参数,(I)遵循通过公式2计算的 an{a}_{n}bn{b}_{n} 是定积分,(II)和(III)遵循矩阵运算的等价性,[]\left\lbrack {\cdot \left| \right| \cdot }\right\rbrack[,]\left\lbrack {\cdot , \cdot }\right\rbrack 分别表示沿第一个和第二个维度的连接。

To fully leverage the advantages of deep learning,we can stack the aforementioned network fS(x){f}_{\mathrm{S}}\left( x\right) to form a deep network fD(x){f}_{\mathrm{D}}\left( x\right) ,where the ii -th layer,denoted as li(x){l}_{i}\left( x\right) ,retains the same structural design as fS(x){f}_{\mathrm{S}}\left( x\right) . Therefore, fD(x){f}_{\mathrm{D}}\left( x\right) can be formulated as:

为了充分利用深度学习的优势,我们可以堆叠上述网络 fS(x){f}_{\mathrm{S}}\left( x\right) 以形成一个深度网络 fD(x){f}_{\mathrm{D}}\left( x\right),其中第 ii 层,记作 li(x){l}_{i}\left( x\right),保留与 fS(x){f}_{\mathrm{S}}\left( x\right) 相同的结构设计。因此,fD(x){f}_{\mathrm{D}}\left( x\right) 可以表述为:

where l1x{l}_{1} \circ x denotes the application of the left function l1{l}_{1} to the right input xx ,that is l1(x){l}_{1}\left( x\right) . However, we discover that the direct stacking of fS(x){f}_{\mathrm{S}}\left( x\right) results in the primary parameters of the model fD(x){f}_{\mathrm{D}}\left( x\right) focusing on learning the angular frequency (ωn=2πnT)\left( {{\omega }_{n} = \frac{2\pi n}{T}}\right) ,thereby neglecting the learning of the Fourier coefficients (an\left( {a}_{n}\right. and bn)\left. {b}_{n}\right) ,as follows:

在这里 l1x{l}_{1} \circ x 表示将左侧函数 l1{l}_{1} 应用于右侧输入 xx,即 l1(x){l}_{1}\left( x\right)。然而,我们发现直接堆叠 fS(x){f}_{\mathrm{S}}\left( x\right) 导致模型的主要参数 fD(x){f}_{\mathrm{D}}\left( x\right) 专注于学习角频率 (ωn=2πnT)\left( {{\omega }_{n} = \frac{2\pi n}{T}}\right),从而忽视了傅里叶系数 (an\left( {a}_{n}\right.bn)\left. {b}_{n}\right) 的学习,如下所示:

where l1:L1x{l}_{1 : L - 1} \circ x is defined as lL1lL2l1x,Win L(l1:L1x){l}_{L - 1} \circ {l}_{L - 2} \circ \cdots \circ {l}_{1} \circ x,{W}_{\text{in }}^{L}\left( {{l}_{1 : L - 1} \circ x}\right) is used to approximate the angular frequencies,and Wout L{W}_{\text{out }}^{L} is used to approximate the Fourier coefficients. Therefore,the capacity of fD(x){f}_{\mathrm{D}}\left( x\right) to fit the Fourier coefficients is independent of the depth of fD(x){f}_{\mathrm{D}}\left( x\right) ,which is an undesirable outcome.

在这里 l1:L1x{l}_{1 : L - 1} \circ x 被定义为 lL1lL2l1x,Win L(l1:L1x){l}_{L - 1} \circ {l}_{L - 2} \circ \cdots \circ {l}_{1} \circ x,{W}_{\text{in }}^{L}\left( {{l}_{1 : L - 1} \circ x}\right) 用于近似角频率,而 Wout L{W}_{\text{out }}^{L} 用于近似傅里叶系数。因此,fD(x){f}_{\mathrm{D}}\left( x\right) 拟合傅里叶系数的能力与 fD(x){f}_{\mathrm{D}}\left( x\right) 的深度无关,这是一个不理想的结果。

Figure 2: Illustrations of MLP layer Φ(x)\Phi \left( x\right) vs. FAN layer ϕ(x)\phi \left( x\right) .

图 2:MLP 层 Φ(x)\Phi \left( x\right) 与 FAN 层 ϕ(x)\phi \left( x\right) 的示意图。

To this end, we design FAN based on the following principles: 1) the capacity of FAN to represent the Fourier coefficients should be positively related to its depth; 2) the output of any hidden layer can be employed to model periodicity using Fourier Series through the subsequent layers. The first one enhances the expressive power of FAN for periodicity modeling by leveraging its depth, while the second one ensures that the features of FAN's intermediate layers are available to perform periodicity modeling.

为此,我们基于以下原则设计了 FAN:1)FAN 表示傅里叶系数的能力应与其深度正相关;2)任何隐藏层的输出都可以通过后续层使用傅里叶级数来建模周期性。第一个原则通过利用 FAN 的深度增强了其在周期性建模中的表达能力,而第二个原则确保 FAN 中间层的特征可用于执行周期性建模。

Suppose we decouple fS(x){f}_{\mathrm{S}}\left( x\right) as follows:

假设我们将 fS(x){f}_{\mathrm{S}}\left( x\right) 解耦如下:

where

在这里

To satisfy both principles,the inputs of the intermediate layers in FAN necessitate to employ fin{f}_{in} and fout {f}_{\text{out }} simultaneously,rather than applying them sequentially.

为了满足这两个原则,FAN 中间层的输入需要同时使用 fin{f}_{in}fout {f}_{\text{out }},而不是顺序应用它们。

Finally,FAN is designed on this basis,with the FAN layer ϕ(x)\phi \left( x\right) defined as below:

最后,FAN 在此基础上设计,FAN 层 ϕ(x)\phi \left( x\right) 定义如下:

where WpRdx×dp,WpˉRdx×dpˉ{W}_{p} \in {\mathbb{R}}^{{d}_{x} \times {d}_{p}},{W}_{\bar{p}} \in {\mathbb{R}}^{{d}_{x} \times {d}_{\bar{p}}} ,and BpˉRdpˉ{B}_{\bar{p}} \in {\mathbb{R}}^{{d}_{\bar{p}}} are learnable parameters (with the hyper-parameters dp{d}_{p} and dpˉ{d}_{\bar{p}} indicating the first dimension of Wp{W}_{p} and Wpˉ{W}_{\bar{p}} ,respectively),the layer output ϕ(x)R2dp+dpˉ\phi \left( x\right) \in {\mathbb{R}}^{2{d}_{p} + {d}_{\bar{p}}} ,and σ\sigma denotes the activation function,which can further enhance its expressive power for periodicity modeling.

在这里 WpRdx×dp,WpˉRdx×dpˉ{W}_{p} \in {\mathbb{R}}^{{d}_{x} \times {d}_{p}},{W}_{\bar{p}} \in {\mathbb{R}}^{{d}_{x} \times {d}_{\bar{p}}}BpˉRdpˉ{B}_{\bar{p}} \in {\mathbb{R}}^{{d}_{\bar{p}}} 是可学习的参数(超参数 dp{d}_{p}dpˉ{d}_{\bar{p}} 分别表示 Wp{W}_{p}Wpˉ{W}_{\bar{p}} 的第一维度),层输出 ϕ(x)R2dp+dpˉ\phi \left( x\right) \in {\mathbb{R}}^{2{d}_{p} + {d}_{\bar{p}}},而 σ\sigma 表示激活函数,这可以进一步增强其在周期性建模中的表达能力。

The entire FAN is defined as the stacking of the FAN layer ϕ(x)\phi \left( x\right) :

整个 FAN 定义为 FAN 层 ϕ(x)\phi \left( x\right) 的堆叠:

where

在这里

The illustrations of the MLP layer Φ(x)\Phi \left( x\right) vs. the FAN layer ϕ(x)\phi \left( x\right) are shown in Figure 2. Note that the FAN layer ϕ(x)\phi \left( x\right) computed via Eq. (9) can seamlessly replace the MLP layer Φ(x)\Phi \left( x\right) computed via Eq. 12) in various models with fewer parameters and FLOPs. The number of parameters and FLOPs of the FAN layer compared to the MLP layer are presented in Table 1.

图 2 显示了 MLP 层 Φ(x)\Phi \left( x\right) 与 FAN 层 ϕ(x)\phi \left( x\right) 的对比示意图。请注意,通过 Eq. (9) 计算的 FAN 层 ϕ(x)\phi \left( x\right) 可以无缝替代通过 Eq. (12) 计算的 MLP 层 Φ(x)\Phi \left( x\right),在各种模型中具有更少的参数和 FLOPs。FAN 层与 MLP 层的参数数量和 FLOPs 在表 1 中列出。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——