女神图开篇(Winona Ryder)

Attention发展史

2015-2017年，自从 attention 提出后，基本就成为 NLP 模型的标配，各种各样的花式 attention 铺天盖地。不仅在 Machine Translation，在 Text summarization，Text Comprehend（Q&A）, Text Classification 也广泛应用。奠定基础的几篇文章如下：

2015年 ICLR 《Neural machine translation by jointly learning to align and translate》首次提出 attention（基本上算是公认的首次提出），文章提出了最经典的 Attention 结构（additive attention 或者又叫 bahdanau attention）用于机器翻译，并形象直观地展示了 attention 带来源语目标语的对齐效果，解释深度模型到底学到了什么，人类表示服气。

2015年 EMNLP 《Effective Approaches to Attention-based Neural Machine Translation》在基础 attention 上开始研究一些变化操作，尝试不同的 score-function，不同的 alignment-function。文章中使用的 Attention（multiplicative attention 或者又叫 Luong attention）结构也被广泛应用。

2015年 ICML 《Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》是 attention（提出hard/soft attention的概念）在 image caption 上的应用，故事圆满，符合直觉，人类再次表示很服气。

在上面几篇奠基之作之上，2016和2017年 attention 开枝散叶，无往不利。Hiearchical Attention，Attention over Attention，multi-step Attention……这些或叫得上名的或叫不上名。

2017年-至今是属于 transformer 的时代。基于 transformer 强大的表示学习能力，NLP 领域爆发了新一轮的活力，BERT、GPT 领跑各项 NLP 任务效果。奠基之作无疑是：

2017年 NIPS《Attention is all you need》提出 transformer 的结构（涉及 self-attention，multi-head attention）。基于 transformer 的网络可全部替代sequence-aligned 的循环网络，实现 RNN 不能实现的并行化，并且使得长距离的语义依赖与表达更加准确（据说2019年的 transformer-xl《Transformer-XL：Attentive Lanuage Models Beyond a fixed-length context》通过片段级循环机制结合相对位置编码策略可以捕获更长的依赖关系）。

Attention原理

问题定义：（如上图）

$input: (y_1, y_2, \cdots,y_n)$ and Context vector $C$

output: $Z$
所有attention模型都有三个部分组成：

score function（度量函数）：度量环境向量与当前输入向量的相似性；找到当前环境下，应该 focus 哪些输入信息

$e_{i} = a(y_i,c) \qquad \qquad$

$a$ 是度量函数
alignment function（对齐函数）：计算 attention weight，通常都使用 softmax 进行归一化；

$\alpha_i = softmax \frac {exp(e_i)} {\sum_{j=1}^{n} exp(e_j)}$
generate context vector function （汇集函数）：根据 attention weight，得到输出向量

$Z = \sum_{i=1}^n \alpha_iy_i$

attention还有一种 $(Q,K,V)$ 视角（应该是始于《Attention is all you need》中提出的self attention（是吗？

$(Q,K,V)$ 视角：假设输入为 q，Memory 中以（k，v）形式存储需要的上下文。例如在在 Q&A 任务中，k 是 question，v 是 answer，q 是新来的 question，看看历史 memory 中 q 和哪个 k 更相似，然后依葫芦画瓢，根据相似 k 对应的 v，合成当前 question 的 answer。

Attention 分类

参考非常好的attention总结

参考写得不错，排版不太行

通常听到的一些 attention，他们的差异其实主要体现在 score-function 层面，其次是体现在 generate context vector function 的层面。

1.change in generate context vector function

hard / soft attention 是在文章《Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》提出的概念，最直观的一种理解是，hard attention 是一个随机采样，采样集合是输入向量的集合，采样的概率分布是alignment function 产出的 attention weight。
因此，hard attention 的输出是某一个特定的输入向量。soft attention 是一个带权求和的过程，求和集合是输入向量的集合，对应权重是 alignment function 产出的 attention weight。
hard / soft attention 中，soft attention 是更常用的（后文提及的所有 attention 都在这个范畴），因为它可导，可直接嵌入到模型中进行训练

change in alignment function

在 soft attention 中，又划分了 global/local attention（In this paper ：《Effective Approaches to Attention-based Neural Machine Translation》）
直观理解就是带权求和的集合不一样，global attention 是所有输入向量作为加权集合，使用 softmax 作为 alignment function
local 是部分输入向量才能进入这个池子。为什么用 local，背后逻辑是要减小噪音，进一步缩小重点关注区域
更常用的依然还是 global attention，因为复杂化的local attention 带来的效果增益感觉并不大。

change in score-function

变化最丰富的是score function。其实本质就是度量两个向量的相似度
如果两个向量在同一个空间，那么可以使用 dot 点乘方式（或者 scaled dot product，scaled 背后的原因是为了减小数值，softmax 的梯度大一些，学得更快一些），简单好使
如果不在同一个空间，需要一些变换（在一个空间也可以变换），additive 对输入分别进行线性变换后然后相加，multiplicative 是直接通过矩阵乘法来变换

Bahdanau Attention & Luong Attention

这两个 Attention 就是整个 Attention 的奠基之作。Tensorflow 中实现了这两种 Attention 的 API。看图

Self Attention & Multi-head Attention

参考 self attention 详细介绍

介绍这个的很多，略。

基于QKV模型的attention在上面介绍过。说明几点。

QKV 都是对输入 x 的线性映射。
score-function 使用 scaled-dot product。
multihead 的方式将多个 head 的输出 z，进行 concat 后，通过线性变换得到最后的输出 z。

transformer 框架中 self-attention 本身是一个很大的创新

代码实战

参考这个详细

需要再看

Attention全面介绍