快速批量处理 PDF:Doc2X 为您服务 想要批量处理 PDF 转 Word、Latex 或 Markdown?Doc2X 提供高效的 公式解析、表格识别、代码解析,支持 深度翻译 和 大模型训练语料提取,为科研文档处理提速! Fast Batch PDF Processing: Doc2X at Your Service Need batch processing for PDF to Word, LaTeX, or Markdown? Doc2X delivers efficient formula parsing, table recognition, and code parsing, with support for advanced translation and large-model training corpus extraction, boosting research productivity! 👉 立即访问 Doc2X | Visit Doc2X Now
RNNS ARE NOT TRANSFORMERS (YET): The Key Bottleneck on In-context Retrieval
RNNS 还不是 TRANSFORMERS(目前):上下文检索的关键瓶颈
Kaiyue Wen Xingyu Dang Kaifeng Lyu
温凯越 党星宇 吕凯丰
Institute for Interdisciplinary Information Sciences,Tsinghua University
清华大学交叉信息研究院
Department of Computer Science &Princeton Language and Intelligence,Princeton University
普林斯顿大学计算机科学与普林斯顿语言与智能系
{wenky20,dangxy20}@mails.tsinghua.edu.cn
{wenky20,dangxy20}@mails.tsinghua.edu.cn
ABSTRACT
摘要
This paper investigates the gap in representation powers of Recurrent Neural Networks (RNNs) and Transformers in the context of solving algorithmic problems. We focus on understanding whether RNNs, known for their memory efficiency in handling long sequences, can match the performance of Transformers, particularly when enhanced with Chain-of-Thought (CoT) prompting. Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, can elevate RNNs to be capable of solving all polynomial-time solvable problems with ,hence closing the representation gap with Transformers.
本文研究了在解决算法问题背景下,循环神经网络(RNNs)和Transformer在表示能力上的差距。我们重点探讨了RNNs,以其处理长序列时的记忆效率著称,是否能在Chain-of-Thought(CoT)提示的增强下,达到Transformer的性能。我们的理论分析表明,CoT虽然提升了RNNs的性能,但不足以缩小与Transformer的差距。一个关键瓶颈在于RNNs无法完美地从上下文中检索信息,即使有CoT:对于几个明确或隐含需要这种能力的任务,如联想回忆和判断图是否为树,我们证明了RNNs的表达能力不足以解决这些任务,而Transformer则可以轻松解决。相反,我们证明采用技术来增强RNNs的上下文检索能力,包括检索增强生成(RAG)和添加单个Transformer层,可以使RNNs能够解决所有多项式时间可解的问题,从而缩小与Transformer的表示差距。
1 Introduction
1 引言
Transformer models (Vaswani et al., 2017) have become the dominant choice of the backbone for large language models (LLMs). The core component of Transformers is its self-attention module, which allows the model to route information densely across the entire sequence. However, this design leads to high inference costs for modeling long sequences, including a memory cost that is at least linear in the sequence length due to the need for maintaining intermediate attention keys and values for each token, and a time cost quadratic in the sequence length for computing the attention score for each pair of tokens.
Transformer 模型(Vaswani 等,2017)已成为大型语言模型(LLMs)的主干选择。Transformer 的核心组件是其自注意力模块,该模块允许模型在整个序列中密集地路由信息。然而,这种设计导致在处理长序列时推理成本高昂,包括由于需要为每个标记维护中间注意力键和值而导致的至少线性的内存成本,以及为计算每对标记的注意力分数而导致的序列长度二次方的时间成本。
Recently, Recurrent Neural Networks (RNNs) have been an increasingly popular choice in sequence modeling tasks due to their ability to maintain a memory size constant in sequence length during inference, thus being more memory efficient than Transformers. Katharopoulos et al. (2020) showed that Transformers with a special type of kernelized linear attention can be expressed as RNNs. Gu et al. (2022) took a different path to design RNNs by structuring latent states as State Space Models (SSMs) from control theory. These ideas have led to a series of development of modern RNNs, including RWKV (Peng et al., 2023), RetNet (Sun et al., 2023), and Mamba (Gu & Dao, 2023). Most notably, Mamba can achieve competitive performance with Transformers on several sequence modeling tasks with linear time and constant memory complexity in sequence length.
最近,循环神经网络(RNNs)由于在推理过程中能够保持与序列长度无关的恒定内存大小,因此在序列建模任务中越来越受欢迎,从而比 Transformer 更具内存效率。Katharopoulos 等(2020)表明,具有特殊类型核化线性注意力的 Transformer 可以表示为 RNNs。Gu 等(2022)通过将潜在状态结构化为控制理论中的状态空间模型(SSMs)来设计 RNNs。这些想法导致了现代 RNNs 的一系列发展,包括 RWKV(Peng 等,2023)、RetNet(Sun 等,2023)和 Mamba(Gu & Dao,2023)。最值得注意的是,Mamba 可以在序列建模任务中以线性时间和常数内存复杂度实现与 Transformer 相当的性能。
Can RNNs replace Transformers yet? The rise of these modern RNNs has led to an interest in understanding their limitations. A recent work by Arora et al. (2023) showed that an important family of RNNs, input-independent gating SSMs, are empirically inferior to Transformers in a task that has a long history in artificial intelligence, associative recall (AR) (Willshaw et al., 1969; Hopfield, 1982; Hinton & Anderson, 2014): Given a series of key-value pairs as a string, the model is required to recall the value given a key. On the theory side, Sanford et al. (2023) and Jelassi et al.
RNN 能否取代 Transformer?这些现代 RNN 的兴起引发了对其局限性的研究兴趣。Arora 等人(2023)的一项最新研究表明,一个重要的 RNN 家族,即输入无关门控 SSM,在具有悠久历史的人工智能任务——联想回忆(AR)(Willshaw 等人,1969;Hopfield,1982;Hinton & Anderson,2014)中,经验上不如 Transformer:给定一系列键值对作为字符串,模型需要根据键回忆出相应的值。在理论方面,Sanford 等人(2023)和 Jelassi 等人
*Equal contribution
*同等贡献
Corresponding author
通讯作者
Code is available at github.com/dangxingyu/…
代码可在 github.com/dangxingyu/… 获取
Figure 1: Hierarchy of Representation Power. While RNN with chain-of-thought (CoT) with bit memory provably has strictly stronger representation power than RNN without CoT under mild complexity assumptions (Theorem 4.1), it is still exponentially weaker than Transformer with CoT in representing solutions to algorithmic problems (Theorem 4.7). We proceed to show that the incapability of RNNs in In-context Retrieval is the root cause of the gap and propose two forms of In-context Retrieval Augmented Generation (In-context RAG) to close the gap by illustrating their power to simulate any polynomial-time Turing machines (Theorems 5.4 and 5.8).
图 1:表示能力的层次结构。尽管带有链式思维(CoT)和 位内存的 RNN 在适度复杂性假设下(定理 4.1)证明比没有 CoT 的 RNN 具有更强的表示能力,但在表示算法问题解决方案方面,它仍然比带有 CoT 的 Transformer 弱指数级(定理 4.7)。我们进一步表明,RNN 在上下文检索中的无能是差距的根本原因,并提出了两种形式的上下文检索增强生成(In-context RAG),通过展示它们模拟任何多项式时间图灵机的能力来缩小这一差距(定理 5.4 和 5.8)。
(2024) demonstrated that constant-memory RNNs do not have sufficient representation power to solve the tasks of averaging a given subset of input vectors ( -sparse averaging) and repeating the input sequence (copying),respectively, while there exist shallow Transformers that can solve these tasks.
(2024)证明了常量内存 RNN 没有足够的表示能力来解决给定输入向量子集的平均任务( -稀疏平均)和重复输入序列(复制),而存在能够解决这些任务的浅层 Transformer。
However, the above results do not exclude the possibility that enhancing RNNs with additional prompting techniques or minor architectural changes may close the gap with Transformers. In fact, Transformers themselves are not perfect either and may need additional techniques at inference time to perform well on certain tasks. As a notable example, Chain-of-Thought (CoT) (Wei et al., 2023), a prompting technique that asks the model to generate a series of intermediate tokens before giving the final answer, has been known to be crucial for Transformers to perform well on tasks that require mathematical or algorithmic reasoning. Feng et al. (2023); Li et al. (2024) explained this from the perspective of representation power: Transformers alone do not have sufficient representation power to solve problems beyond a certain circuit complexity class ,but with ,they can even simulate any polynomial-time Turing machines.
然而,上述结果并不排除通过增强RNN的额外提示技术或微小的架构变化来缩小与Transformer之间差距的可能性。事实上,Transformer本身也并非完美,在某些任务上可能需要在推理时采用额外的技术才能表现良好。一个显著的例子是思维链(Chain-of-Thought, CoT)(Wei et al., 2023),这是一种提示技术,要求模型在给出最终答案之前生成一系列中间标记,已被证明对Transformer在需要数学或算法推理的任务上表现良好至关重要。Feng et al. (2023); Li et al. (2024)从表示能力的角度解释了这一点:单独的Transformer没有足够的表示能力来解决超出某个电路复杂度类的问题,但通过,它们甚至可以模拟任何多项式时间图灵机。
The effectiveness of CoT on Transformers naturally leads to the following question:
CoT对Transformer的有效性自然引出了以下问题:
Can similar enhancements, such as adopting CoT, improve RNNs so that they can be on par with Transformers?
类似增强技术,如采用CoT,能否提升RNN,使其与Transformer相媲美?
Our Contributions. This paper answers the above question from theory by examining various ways to close the gap in the representation powers of RNNs and Transformers on algorithmic problems. Through a series of lower and upper bound results, we show that CoT improves the representation power of RNNs, but to close the gap with Transformers, CoT alone is not enough to overcome a key bottleneck of RNNs: their inability to retrieve information from the context, which we call in-context retrieval for short. We further illustrate that addressing this in-context retrieval bottleneck is sufficient to close this gap: RNNs can solve all polynomial-time solvable problems if adopting techniques to enhance the in-context retrieval capability, including involving Retrieval-Augmented Generation (RAG) and appending a single Transformer layer. Our main contributions are listed as follows:
我们的贡献。本文通过理论研究回答了上述问题,探讨了在算法问题上缩小RNN和Transformer表示能力差距的各种方法。通过一系列上下界结果,我们表明CoT提升了RNN的表示能力,但要缩小与Transformer的差距,仅靠CoT不足以克服RNN的一个关键瓶颈:它们无法从上下文中检索信息,我们简称为上下文检索。我们进一步说明,解决这个上下文检索瓶颈足以缩小这一差距:如果采用增强上下文检索能力的技术,包括涉及检索增强生成(RAG)和附加单个Transformer层,RNN可以解决所有多项式时间可解的问题。我们的主要贡献如下:
1.CoT improves RNNs but cannot close the representation gap with Transformers. (Section 4)
1. CoT 提高了 RNNs 的性能,但无法弥合与 Transformers 之间的表示差距。(第 4 节)
-
On the positive side, we prove that CoT makes RNNs strictly more expressive under mild assumptions from circuit complexity.
-
从积极的一面来看,我们证明了在电路复杂性的温和假设下,CoT 使 RNNs 的表达能力严格增强。
-
On the negative side, we show that adopting CoT is not enough to close the representation gap between RNNs and Transformers: the memory efficiency of RNNs fundamentally limits their ability to perform in-context retrieval, even with CoT. This point is made concrete by proving that RNNs with CoT cannot solve a set of fundamental algorithmic problems that directly ask for in-context retrieval, including associative recall. We further exemplify that in-context retrieval can be implicitly required in tasks that appear unrelated, by proving the inability of RNNs to solve the classic problem of determining if a graph is a tree (IsTree).
-
从消极的一面来看,我们表明采用 CoT 不足以弥合 RNNs 和 Transformers 之间的表示差距:RNNs 的内存效率从根本上限制了它们进行上下文检索的能力,即使使用 CoT 也是如此。这一点通过证明带有 CoT 的 RNNs 无法解决一组基本的算法问题而变得具体,这些问题直接要求进行上下文检索,包括联想回忆。我们进一步举例说明,上下文检索可能在看似无关的任务中隐含地被要求,通过证明 RNNs 无法解决经典的判断图是否为树的问题(IsTree)。
-
On the other hand, we prove that Transformers have the representation power to solve many of the above tasks with ease, including IsTree. Moreover, Transformers with CoT can even simulate RNNs with CoT efficiently, with only a small multiplicative factor in the number of parameters.
-
另一方面,我们证明了 Transformers 具有轻松解决上述许多任务的表示能力,包括 IsTree。此外,带有 CoT 的 Transformers 甚至可以高效地模拟带有 CoT 的 RNNs,参数数量仅增加一个小的乘法因子。
Technically, the key insight for the first separation is that RNNs without CoT is a shallow circuit, while RNNs with CoT can be an exponentially deeper circuit. But on the negative side, RNNs are so memory efficient that they can trigger streaming lower bounds (Sanford et al., 2023), especially for problems that require in-context retrieval.
- 从技术上讲,第一个分离的关键见解是,没有 CoT 的 RNNs 是一个浅层电路,而带有 CoT 的 RNNs 可以是一个指数级更深的电路。但在消极方面,RNNs 的内存效率如此之高,以至于它们可以触发流式下界(Sanford 等人,2023),特别是对于需要上下文检索的问题。
Figure 2: We train RNNs (Mamba) and Transformers (LLaMA 2 Touvron et al. (2023)) with a frozen word embedding and decoding head of three different model sizes(0.5M,1M,2M)on IsTree with three different sizes of graph , 32,64) under three different setups. Vanilla means the model directly predicts the label. COT means the model will generate a chain-of-thought process based on DFS (see Algorithm 1) before prediction. Retrieval means the model will generate the chain of search queries and reasoning before prediction (see Algorithm 2). We observe that (1) Both Transformer and RNNs can’t solve the IsTree question without a chain of thought; (2) RNNs’ performance with chain-of-thought decays quickly when the number of nodes increase, which is consistent with our theory; (3) All models reach almost perfect accuracy when enhanced with retrieval.
图2:我们在三种不同大小的图 、32、64 上,使用冻结的词嵌入和解码头训练了三种不同模型大小(0.5M、1M、2M)的RNNs(Mamba)和Transformers(LLaMA 2 Touvron et al. (2023))。Vanilla表示模型直接预测标签。COT表示模型在预测前会基于DFS生成一个思维链过程(见算法1)。Retrieval表示模型在预测前会生成搜索查询和推理的链(见算法2)。我们观察到:(1)没有思维链的情况下,Transformer和RNNs都无法解决IsTree问题;(2)随着节点数量的增加,RNNs在有思维链的情况下的性能迅速下降,这与我们的理论一致;(3)所有模型在增强检索功能后几乎达到了完美的准确率。
2. Enhancing the in-context retrieval capability of RNNs can close the representation gap. (Section 5)
2. 增强RNNs的上下文检索能力可以缩小表示差距。(第5节)
-
We prove that allowing RNNs to invoke function calls to perform a certain primitive of in-context retrieval is sufficient to boost their representation power to solve all polynomial-time solvable problems with CoT, hence closing the representation gap between RNNs and Transformers.
-
我们证明了允许RNNs调用函数来执行某种上下文检索的基本操作足以提升其表示能力,从而解决所有多项式时间可解的问题,并因此缩小了RNNs与Transformers之间的表示差距。
-
Alternatively, as one layer of transformer is sufficient to perform many in-context retrieval operations, we prove that implicitly enhancing the in-context retrieval capability of RNNs by adding just one transformer layer at the end of the architecture is also sufficient to close the representation gap.
-
或者,由于一层Transformer足以执行许多上下文检索操作,我们证明了通过在架构末尾添加一层Transformer来隐式增强RNNs的上下文检索能力也足以缩小表示差距。
Technically, the key insight for the above upper bounds is that RNN can focus on the local reasoning steps and use the in-context retrieval module to adaptively fetch the relevant information from the context.
从技术上讲,上述上限的关键见解是RNN可以专注于局部推理步骤,并使用上下文检索模块从上下文中自适应地获取相关信息。
2 Related Works
2 相关工作
State Space Machines and Linear Transformers. There has been a recent surge of interest in state space machines and (kernalized) linear transformers (Gu et al., 2022; Katharopoulos et al., 2020; Peng et al., 2023; Sun et al., 2023; Gu & Dao, 2023; Fu et al., 2023; Poli et al., 2023; Luo et al., 2021; Peng et al., 2021; Wang et al., 2020), which are a class of models that combine the parallelizability of the Transformer with the memory efficiency of the RNN. These models can process both a sequential and a recurrent form, and can use the former for fast parallelizable training and the latter for memory-efficient inference. However, these models are still empirically inferior to the Transformer in terms of performance. Our work investigates the reasons behind this gap and proposes to close it by enhancing the in-context retrieval capability.
状态空间机器和线性变换器。最近,人们对状态空间机器和(核化的)线性变换器(Gu et al., 2022; Katharopoulos et al., 2020; Peng et al., 2023; Sun et al., 2023; Gu & Dao, 2023; Fu et al., 2023; Poli et al., 2023; Luo et al., 2021; Peng et al., 2021; Wang et al., 2020)产生了浓厚的兴趣,这些模型结合了Transformer的并行性和RNN的内存效率。这些模型可以处理顺序和递归形式,并可以使用前者进行快速并行训练,使用后者进行内存高效推理。然而,这些模型在性能上仍然在经验上不如Transformer。我们的工作调查了这一差距背后的原因,并提出通过增强上下文检索能力来缩小这一差距。
Chain of Thought (CoT). Chain of thought (Wei et al., 2023; Nye et al., 2021; Kojima et al., 2023; Wang & Zhou, 2024) is an augmentation to the Transformer, that allows it to solve more complex reasoning tasks by generating a reasoning process before outputting the answer. It has been shown that Transformers with CoT provably have more expressive power than the original Transformer without CoT (Feng et al., 2023; Li et al., 2024). However, the expressive power of RNNs with CoT has not yet been systematically studied. Theorem F. 1 in Feng et al. (2023) shows that RNN cannot output a particular format of CoT for evaluating arithmetic expressions and solving linear equations while Transformers with the same amount of parameters can. Concurrent work (Yang et al., 2024) discovers that linear Transformers, a special class of RNNs, are not able to solve some dynamic programming problems with CoT, unless the number of parameters grows with the length of the input. One high-level message our work conveys is similar to theirs: RNNs have limited representation power to perform reasoning with CoT. However, we show that such limitation is not specific to the output format or architecture and apply tools from streaming complexity to prove lower bounds on a broader range of tasks and memory-efficient architectures.
思维链 (CoT)。思维链 (Wei et al., 2023; Nye et al., 2021; Kojima et al., 2023; Wang & Zhou, 2024) 是对 Transformer 的一种增强,它允许 Transformer 在输出答案之前生成一个推理过程,从而解决更复杂的推理任务。研究表明,具有 CoT 的 Transformer 在表达能力上明显优于没有 CoT 的原始 Transformer (Feng et al., 2023; Li et al., 2024)。然而,具有 CoT 的 RNN 的表达能力尚未得到系统研究。Feng et al. (2023) 中的定理 F.1 表明,RNN 无法输出特定格式的 CoT 来评估算术表达式和求解线性方程,而具有相同参数数量的 Transformer 则可以。同时期的工作 (Yang et al., 2024) 发现,线性 Transformer(RNN 的一个特殊类别)无法使用 CoT 解决某些动态规划问题,除非参数数量随输入长度增长。我们工作传达的一个高层信息与他们的相似:RNN 在通过 CoT 进行推理时表达能力有限。然而,我们表明这种限制并不特定于输出格式或架构,并应用流复杂性工具来证明更广泛任务和内存高效架构的下界。
Streaming Algorithms. Our lower bound leverages the technique in streaming algorithms. Streaming algorithms are algorithms that take constant (typically just 1) pass over the input and use sublinear space, hence including RNNs with fixed state space as a special case. Works in streaming algorithms date back to the 1980s (Munro & Paterson, 1980) and have been formalized and popularized in the 1990s (Alon et al., 1996) due to the need to process large data streams. The lower bound in our work is a direct application of the technique in streaming algorithms to the study of RNNs and we mainly consider the streaming algorithms for (1) indexing the input (Munro & Paterson, 1980) and (2) determining whether the input is a tree (Henzinger et al., 1998).
流算法。我们的下界利用了流算法中的技术。流算法是一种对输入进行常数(通常仅为1)次遍历并使用次线性空间的算法,因此包括具有固定状态空间的RNN作为特例。流算法的工作可以追溯到20世纪80年代(Munro & Paterson, 1980),并在20世纪90年代由于处理大数据流的需求而得到形式化和普及(Alon et al., 1996)。我们工作中的下界是直接将流算法中的技术应用于RNN研究,我们主要考虑了用于(1)索引输入(Munro & Paterson, 1980)和(2)确定输入是否为树(Henzinger et al., 1998)的流算法。
Retrieval Augmented Generation. Our work proposes to use retrieval augmentation to close the representation gap between RNNs and Transformers. This is consistent with the recent trend of retrieval augmented generation (Guu et al., 2020; Borgeaud et al., 2022; Rubin & Berant, 2023). Empirically, retrieval augmented generation has been shown to improve the performance of recurrent models in various tasks (Kuratov et al., 2024; Akyürek et al., 2024) and our work provides a theoretical foundation for this phenomenon. Our work also shows that an attention layer can be used to simulate the retrieval process, which is consistent with the finding that attention can improve the performance of RNNs (Vaswani et al., 2017; Arora et al., 2023; Park et al., 2024; Peng et al., 2023; Hao et al., 2019). It has also been shown empirically that attention can be used to simulate complex retrieval process (Jiang et al., 2022).
检索增强生成。我们的工作提出使用检索增强来缩小RNN和Transformer之间的表示差距。这与最近的检索增强生成趋势一致(Guu et al., 2020; Borgeaud et al., 2022; Rubin & Berant, 2023)。实证研究表明,检索增强生成在各种任务中提高了循环模型的性能(Kuratov et al., 2024; Akyürek et al., 2024),而我们的工作为此现象提供了理论基础。我们的工作还表明,注意力层可以用来模拟检索过程,这与注意力可以提高RNN性能的发现一致(Vaswani et al., 2017; Arora et al., 2023; Park et al., 2024; Peng et al., 2023; Hao et al., 2019)。实证研究还表明,注意力可以用来模拟复杂的检索过程(Jiang et al., 2022)。
Comparison Between Transformers and RNNs (Without CoT). A line of works focused on the comparison between RNNs and Transformers in terms of recognizing or generating formal languages (Bhattamishra et al., 2020; Hahn, 2020; Merrill et al., 2022). These works show that the lack of recurrent structure in Transformers makes them fail to recognize some formal languages that RNNs can recognize. However, Liu et al. (2023); Yao et al. (2023); Hao et al. (2022) show that such limitation can be mitigated when we consider bounded length of input or bounded grammar depth. Our work differs from these works in that we consider the expressive power of RNNs and Transformers with CoT and show that in this case, the gap between RNNs and Transformers is one-sided (Theorem 4.8).
变压器与RNN的比较(不带CoT)。一系列工作专注于在识别或生成形式语言方面比较RNN和变压器(Bhattamishra et al., 2020; Hahn, 2020; Merrill et al., 2022)。这些工作表明,变压器缺乏递归结构,使其无法识别RNN能够识别的某些形式语言。然而,Liu et al. (2023); Yao et al. (2023); Hao et al. (2022) 表明,当我们考虑输入的有限长度或语法深度的有限性时,这种限制可以得到缓解。我们的工作与这些工作不同之处在于,我们考虑了带有CoT的RNN和变压器的表达能力,并表明在这种情况下,RNN和变压器之间的差距是单向的(定理4.8)。
Prior work (Arora et al., 2023) has shown that input-independent gating SSMs are inferior to Transformers in the task called associative recall (Willshaw et al., 1969; Hopfield, 1982; Hinton & Anderson, 2014). The task requires the model to recall a previously seen pattern given a partial input. They show that input-dependent gating SSMs have better performance in associative recall and also propose a hybrid architecture that combines input-independent state space machines with attention to achieve better performance. Our work differs from this work in the following ways: (1) Our work studies associative recall from a theoretical perspective and proves formal lower bounds on the memory size of RNNs necessary for solving associative recall and other retrieval tasks; (2) We also study hybrid architectures but we provide a proof that appending a single Transformer layer to RNNs can make them expressive enough; (3) Our theory applies to not only input-independent gating SSMs but also all RNNs with -bit memory.
先前的工作(Arora et al., 2023)已经表明,输入无关的门控状态空间模型(SSMs)在称为联想回忆的任务中不如变压器(Willshaw et al., 1969; Hopfield, 1982; Hinton & Anderson, 2014)。该任务要求模型在给定部分输入的情况下回忆起之前见过的模式。他们展示了输入依赖的门控SSMs在联想回忆中的表现更好,并提出了一种混合架构,结合了输入无关的状态空间机与注意力机制以实现更好的性能。我们的工作与这项工作的不同之处在于:(1) 我们的工作从理论角度研究联想回忆,并证明了RNN在解决联想回忆和其他检索任务时所需的内存大小的形式下界;(2) 我们也研究了混合架构,但我们提供了一个证明,即在RNN上附加一个单一的变压器层可以使它们具有足够的表达能力;(3) 我们的理论不仅适用于输入无关的门控SSMs,还适用于所有具有 -bit内存的RNN。
Prior work (Jelassi et al., 2024) proves a representation gap between RNNs and Transformers in repeating a long sequence, which can be seen as a retrieval task. They show that RNNs have difficulty performing the task due to their limited memory. Our work further proves that RNNs are limited in solving many other retrieval tasks, even with CoT. Technically, a key ingredient in their proof is a counting argument on the output sequence to show a limited memory size is not enough to produce too many different output sequences, but our proof can handle retrieval tasks that only require outputting a single token.
先前的工作(Jelassi 等,2024)证明了在重复长序列任务中,RNN 和 Transformer 之间存在表示差距,这可以视为一种检索任务。他们指出,由于 RNN 的记忆能力有限,RNN 在执行该任务时存在困难。我们的工作进一步证明了 RNN 在解决许多其他检索任务时也存在局限性,即使使用 CoT(思维链)也是如此。从技术上讲,他们证明中的一个关键要素是对输出序列进行计数论证,以表明有限的记忆大小不足以产生太多不同的输出序列,但我们的证明可以处理仅需要输出单个标记的检索任务。
Notably, Sanford et al. (2023) apply communication complexity to prove circuit size or memory size lower bounds for RNNs and Transformers on the task of sparse averaging. Sanford et al. (2024) extend this technique to another task called ,a generalization of the associative recall task. Our technique is similar to theirs since our proof is also based on communication complexity. But we consider a broader range of tasks including seemingly irrelevant reasoning tasks such as IsTree, and further explore various ways to close the representation gap.
值得注意的是,Sanford 等(2023)应用通信复杂性理论证明了在稀疏平均任务中,RNN 和 Transformer 的电路大小或记忆大小下界。Sanford 等(2024)将这一技术扩展到另一个称为 的任务,这是联想回忆任务的泛化。我们的技术与他们的类似,因为我们的证明也基于通信复杂性。但我们考虑了更广泛的任务范围,包括看似无关的推理任务,如 IsTree,并进一步探索了多种缩小表示差距的方法。
Representation Theory of RNNs. Another line of works (Li et al., 2021, 2022; Alberti et al., 2023) studies the universal approximation power of RNNs. They show that the upper bound of the approximation power of linear RNNs will be constrained by the dimension of the hidden states. Their works on the high level are consistent with our findings but are not directly comparable because we are considering finite precision compute models with the assistance of CoT or In-context RAG.
RNN 的表示理论。另一系列工作(Li 等,2021, 2022;Alberti 等,2023)研究了 RNN 的通用逼近能力。他们表明,线性 RNN 的逼近能力的上限将受到隐藏状态维度的限制。他们的工作在高层面上与我们的发现一致,但由于我们考虑的是有限精度计算模型,并借助 CoT 或上下文 RAG 的辅助,因此无法直接比较。
3 Preliminaries
3 预备知识
We introduce the definitions that are necessary for understanding our results and defer other definitions to Appendix A.
我们介绍了理解我们结果所需的定义,并将其他定义推迟到附录 A。
—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——