PaperNotes: attention系列 (2) - ANMT1. paper 2. keypoint 提出了 g

1. paper

Effective Approaches to Attention-based Neural Machine Translation 2015

2. keypoint

提出了 global attention和local attention用于nmt。其中global attention类似soft attention，而local attention是结合了soft attention和hard attention的变形。

3. 简介

此时nmt已经开始使用了，但是没有合适的attentin结构本文就提出了global 和 local attention结构。

4. 模型

4.1 概述

模型选用的RNN单元是LSTM，且用的是多层LSTM结构。

在预测阶段，输入LSMT最后一层输出的 $h_t$ 和同时用attention机制计算出的 $c_t$ ，得到一个注意力隐藏层

\mathbf{\widetilde{h}_t}=tanh(\mathbf{W_c}[\mathbf{c_t;h_t}])

${\widetilde{h}_t}$ 通过一个softmax层得到该词被翻译成某个词的概率。

p(y_t|y_{<t}) = softmax(\mathbf{W_s\widetilde{h}_t})

本文提出了global和local两种模型，最主要就是计算 $c_t$ 不同。在预测时，global是考虑target hidden state $h_t$ 与全局source items $\overline{h}_s$ 的alignment weights。对全局source items和对应的alignment Weights求一个weighted average作为 $c_t$

而local根据target预测其在source items位置 $p_t$ ，在窗口内的source hiddent states才参与attention和weighted average计算。

4.2 global attention

如Figure2所示。计算 $c_t$ 时需要考虑encoder的全部hidden state。这里的 $\alpha$ 是变长的，因为source target是变长的。

\mathbf{\alpha_t}(s) = align(\mathbf{h_t, \widetilde{h}_s})
=\frac{exp(score(\mathbf{h_t, \overline{h}_s}))}{\sum_{s'}{exp(score(\mathbf{h_t, \overline{h}_{s'}}))}}

score的计算方法有content based。这三个方法本质是一样的。

score(\mathbf{h_t, \overline{h}_s})=
\begin{cases}
\mathbf{h_t^\top\overline{h}_s}&dot \\
\mathbf{h_t^\top W_a\overline{h}_s} & general \\
\mathbf{v_a}^\top tanh(\mathbf{W_a[h_t;\overline{h}_s]})& concat
\end{cases}

以及location based，即attention只与target hidden state有关。

4.3 local attention

global attention需要考虑之前所有的source hidden state，计算耗时很大。对于长文本（比如文章）这是不能接受的。因此提出了local attention。对每一个target item，模型先预测一个 $p_i$ ，给一个窗口参数D，context vector $c_t$ 就是 $[p_t-D, p_t+D]$ 这个区间的source hidden state 的weighted average。注意这里的 attention weight $\alpha$ 是定长的，因为D确定了。

这里 $p_i$ 的计算方法也提出了2种。

单调对齐。 $p_i$ 和stm模型中词对齐的作用一样。可以简单的设 $p_t = t$ ，认为source item和target item是单调对齐的。
预测。
$p_t=S·sigmoid(\mathbf{v}_p^\top tanh(\mathbf{W_ph_t}))$

$\mathbf{W_p}$ 和 $\mathbf{v_p}$ 都是模型参数，S是source item length。为了让对齐的值更靠近 $p_t$ ，这里用均值在 $p_t$ 附近的高斯分布。

\alpha_t(s)=align(\mathbf{h_t, \widetilde{h}_s})exp(-\frac{(s-p_t)^2}{2\sigma^2})

$\sigma=D/2$ (经验值)， $p_t$ 是一个实数，s是一个在 $p_t$ 窗口内的整数。

4.3 input-feeding approch

尽管考虑了attention,但是每次的attention都是独立计算的，没有用到之前的“对齐信息”。（对齐有点类似记下来前面哪些词已经翻译过了，翻译当前词时主要参考source的哪个词。）本文提出了 input-feeding 方法，把attention hidden state $\mathbf{\widetilde{h}_t}$ 和输入concat，然后作为LSTM的输入。这样的好处有

有了上一个次的对齐信息
网络即有了横向连接，也有了纵向连接。

5. 实验

翻译的效果， ensemble增强的效果最好（emsemble is always good）。

对长度的效果。也是对长句友好地。40到70 没有明显下降。

对比了下不同的attention方式， local+general 最好。