注意力机制(Attention),从字面意思来看和人类的注意力机制类似。人类通过快速扫描全局文本,获得需要重点关注的区域,也就是一般所说的注意力焦点,而后对这一区域投入更多注意力资源,以获取更多所需要关注目标的细节信息,而抑制其他无用信息。这一机制的存在,极大提高了人类从大量的信息中筛选出高价值信息的手段,是人类在长期进化中形成的一种生存机制。深度学习中的注意力机制从本质上讲和人类的选择性机制类似,核心目标也是从众多信息中选择出对当前任务目标更关键的信息。目前注意力机制已经被广泛使用在自然语言处理、图像识别及语音识别等各种不同类型的深度学习任务中,是深度学习技术中最值得关注与深入了解的核心技术之一。 摘要 当今主流的序列变换模型都是基于编码器和解码器架构,编解器架构的背后又是依赖于复杂的递归神经网络(RNN)或者卷积神经网络(CNN)。为了获得更好的表现,我们还可以在编解码器架构的基础上,进一步添加注意力机制。不过在本篇论文中,我们摒弃之前的固有思维,不再依赖于递归神经网络和卷积神经网络,转而使用一种更为简单的网络架构来做序列变换,即注意力机制的序列变换器(Transformer)。为了验证模型的有效性,我们在两项机器翻译中做了实证分析,结果显示注意力机制翻译的质量更好,同时由于可以并行的执行而大幅度减少了模型的训练时间。在模型评估中,模型在WMT2014的英德翻译任务中,获得了BLEU 28.4的分,比现有最好的评价得分高出2分。在WMT 2014的英法翻译任务中,模型在8个GPU上仅训练了3.5天,就破了所有的单模型序列变换的最佳BLEU得分记录,达到了41.8分,所花费的硬件资源和时间成本为仅为先前最优序列变换模型的一小部分。不仅在上述翻译任务表现良好,模型在大规模数据集和小批量数据集条件下,都能成功地完成英语语义成分解析任务,模型表现出了很好的泛化性。 1、介绍 在序列建模和变换任务中,递归神经网络(RNN)、长短时记忆网络(LSTM)以及门递归神经网络(GRNN)都毫无疑问的是现在很多前沿最优深度学习方法的基石,包括语言建模和机器翻译等。由此,很多研究都在尝试拓展递归语言模型和编解码器架构的应用边界。
为了评价转换器模型能否泛化到其他的一些自然语言处理任务中,我们在英语语义解析任务上继续做试验。这项任务的挑战在于:输出序列有较强的语法结构约束,而且比输入的序列更长。即便是基于RNN的序列到序列的变换模型,在较小的数据输入条件,也不能超越当前最优结果(具体可以参见文献37)。
结果如表4所示,尽管没有面向任务的调优,我们的模型仍然表现的非常好,除了递归神经网络语法模型(Recurrent Neural Network Grammar)之外,我们的模型表现超越了之前公布的所有模型。
和RNN序列变换模型相比,即便在WSJ只有4万条训练句子的前提下,我们的变换器模型仍然超越了BerkleyParser。
7、结论
在本篇论文中,我们提出了变换器模型,这也是迄今为止第一个仅仅依赖于注意力机制,而把大多数编码器-解码器结构中的递归层全部替换为多头注意力函数的序列变换器模型。
对于翻译任务,变换器模型相比于递归神经网络和卷积神经网络有着极快的训练速度。在WMT 2014的英德翻译任务中和WMT 2014的英法翻译任务中,我们都达到了前所未有的表现。在前面提高的任务中,我们最好的变换器模型甚至都超越了所有公布的组合模型。
基于这样的结果,我们对基于注意力机制的模型的前景非常看好,并打算将它应用于其他自然语言处理任务中。我们计划把变换器模型扩展到一些处理输入输出时态方面的问题中去,而不再仅局限于文本。此外,我们也计划探究局部受限的注意力机制来高效处理大规模的输入和输出,如图像、音频和视频。探索生成非序列化输出也是我们的目标之一。
训练和评估模型的代码参考:github.com/tensorflow/…。
致谢:我们感谢Nal Kalchbrenner 和Stephan Gouws,他们在实验中给出的一些富有成效的建议、指导和鼓励对我们无比重要。
参考文献
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.
[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[6] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
[7] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
[8] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars. In Proc. of NAACL, 2016.
[9] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
[10] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[12] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[14] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.
[15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
[16] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
[17] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
[18] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
[19] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks. In International Conference on Learning Representations, 2017.
[20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[21] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
[22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
[23] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
[24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006.
[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.
[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
[29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006.
[30] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
[31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
[32] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
[33] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[34] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
[35] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
[37] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.
[38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[39] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
[40] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.