Attention机制论文地址：all you need is attention Attention 可以参阅网上其他

Attention

可以参阅网上其他大佬的详细解读，下面这篇从序列模型RNN和CNN与Attention的多角度比较来解读，详细分析了论文中采用Attention的结构及position embedding的来由，非常值得一看。一文读懂「Attention is All You Need」| 附代码实现

这里有一些关于论文部分细节的讨论 Attention Is All You Need 每周论文一起读

这篇文章详细讲解了Encoder和Decoder框架以及Attention机制深度学习中的注意力机制

至于Attention机制的具体计算过程，如果对目前大多数方法进行抽象的话，可以将其归纳为两个过程：第一个过程是根据Query和Key计算权重系数，第二个过程根据权重系数对Value进行加权求和。而第一个过程又可以细分为两个阶段：第一个阶段根据Query和Key计算两者的相似性或者相关性；第二个阶段对第一阶段的原始分值进行归一化处理；这样，可以将Attention的计算过程抽象为如图10展示的三个阶段。

顶会论文里面的Attention 从2017年顶会论文看Attention Model - PaperWeekly 第50期

这里有一个视频详细讲解，有很多小例子，很生动 The Illustrated Transformer

这是一篇总结性的文章，提到了很多Attention的应用自然语言处理中的自注意力机制（Self-Attention Mechanism）

一篇用例子讲解Attention的博客一步步解析Attention is All You Need！

Transformer结构拆分 Attention Is All You Need 解读

Attention Is All You Need

背景

RNN网络结构，像LSTM、GRU，以及CNN结构成功应用于序列建模，但是也有缺点，一个师难以捕获长序列的序列关系，另外一个是计算耗时，因为依赖关系像RNN就很难并行计算。和不同位置相关的Self-Attention机制在很多领域应用成功。

模型结构

encoder-decoder结构输入序列 $(\mathbf x_1,\mathbf x_2,...,\mathbf x_n)$ 编码前的序列 $(\mathbf z_1,\mathbf z_2,...,\mathbf z_n)$ 解码输出的序列 $(\mathbf y_1,\mathbf y_2,...,\mathbf y_n)$

Transformer结构

在这里插入图片描述

Encoder

有6层，每一层有2个子层，第一个子层是multi-head self-attention，第二个子层是positionwise fully connected feed-forward network，每个子层使用残差网络连接，对输出加一个layer normalization，每个子层的输出是 $LayerNorm(x + Sublayer(x))$ ,所有输出的embedding维度为512

Decoder

也是6层，每一层有3个子层，除了encoder中的两个子层，还有一个Encoder-Decoder-Attention层，和encoder stack的输出相连。和encoder一样，每个子层也做了LayerNorm的操作。Decoder中的子层self-Attention做了mask操作，主要是为了预测位置i时，仅用到位置i之前的信息。

在这里插入图片描述

Attention

Attention函数是一个query到一系列key-value pair的映射(query、keys、values都是向量)，输出是values的加权和

Scaled Dot-Product Attention

在这里插入图片描述

输入包括queries和keys(维度为 $d_k$ )以及values(维度为 $d_v$ )，计算query和keys的点乘，然后除以 $\sqrt {d_k}$ ，然后使用softmax得出values上面的权重。 $Attention(Q,K,V) = softmax(\frac {QK^T} {\sqrt {d_k} })V$ 这里除以 $\sqrt {d_k}$ 主要是防止点积结果太大，导致softmax后值比较接近0或者1，回传的梯度就很小。

Multi-Head Attention

$MultiHead(Q,K,V) = concat(head_1,head_2,...,head_h)W^O$ $head_i=Attention(QW^Q,KW^K,VW^V)$ $W^Q \in R^{d_{model} \times d_k},W^K \in R^{d_{model} \times d_k},W^V \in R^{d_{model} \times d_v},W^O \in R^{h_kd_v \times d_{model}}$ 其实就是将Attention并行做了 $h$ 次，每个Attention的输出结果连接起来，这里每个Attention可以理解为向量空间的不同视角。论文中使用的参数设置 $h=8,d_k=d_v=d_{model}/h=64$

在这里插入图片描述 multi-head attention 代码解析

    with tf.variable_scope(scope, reuse=reuse):
        # Set the fall back option for num_units
        if num_units is None:
            num_units = queries.get_shape().as_list[-1]
        
        # Linear projections  全连接层在做线性投影
        Q = tf.layers.dense(queries, num_units, activation=None) # (N, T_q, C)
        K = tf.layers.dense(keys, num_units, activation=None) # (N, T_k, C)
        V = tf.layers.dense(keys, num_units, activation=None) # (N, T_k, C)
        # Split and concat
        # 以2个head为例，原理是q通过split分割为2个，qi1和qi2，分别对应相乘，最后将得到的所有b  concat起来。
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h) 
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)
        # Multiplication  QK
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)
        # Scale
        # d是q跟k的维度。因为q*k的数值会随着dimension的增大而增大，所以要除以d的值，对结果进行缩放
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
        # Key Masking
        # 过短的句子可以通过 padding 增加到固定的长度，但是 padding 对应的字符只是为了统一长度，并没有实际的价值，因此希望在之后的计算中屏蔽它们，这时候就需要 Mask。
        # mask之后的输出需要为负无穷，这样softmax之后输出的权重才为0.
        key_masks = tf.sign(tf.reduce_sum(tf.abs(keys), axis=-1)) # (N, T_k)
        key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
        
        paddings = tf.ones_like(outputs)*(-2**32+1)
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)
        # 在Q_i * K_j中，当i<j时，存在一种“穿越”现象: q是查询向量，不能与之后的历史物品的信息进行交互，因此作者提出禁止所有该情况的交互。
        # 一般是通过生成一个下三角矩阵来实现的，上三角区域对应要mask的部分。这样，比如输入 B 在self-attention之后，也只和A，B有关，而与后序信息无关。
        # Causality = Future blinding
        if causality:
            diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)
            tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k)
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)
   
            paddings = tf.ones_like(masks)*(-2**32+1)
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)
  
        # Activation
        outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)
         
        # Query Masking     这里query 也mask？
        query_masks = tf.sign(tf.reduce_sum(tf.abs(queries), axis=-1)) # (N, T_q)
        query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
        outputs *= query_masks # broadcasting. (N, T_q, C)
          
        # Dropouts
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
               
        # Weighted sum
        outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)
        
        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)
              
        # Residual connection  残差连接，防止梯度消失，同时使得attention结构层数可以加到很高
        outputs += queries
              
        # Normalize
        #outputs = normalize(outputs) # (N, T_q, C)

Layer Normalization

LN有2组参数，gamma和beta，shape和输入向量x相同。layer normalization 和batch normalization的区别是选择归一化的axis不同，batch normalization是在各特征维度上面分别归一化，layer normalization是在各个样本上面归一化。为啥这里需要layer normalization呢，这和应用场景有关系，Transformer解决的是句子序列问题，每个句子序列长短不一，但是处理的时候需要保持相同长度，不足长度的要用padding补齐，以下列示意图为例，layer normalization相当于在行上面做归一化，batch normalization 相当于在列上做归一化，由于序列长短不一，在特征列上面做归一化很容易碰到某些列上面大部分都是padding，很难计算准确。在行上面归一化就比较稳定。在这里插入图片描述

LN代码

def normalize(inputs, 
              epsilon = 1e-8,
              scope="ln",
              reuse=None):
    '''Applies layer normalization.
    
    Args:
      inputs: A tensor with 2 or more dimensions, where the first dimension has
        `batch_size`.
      epsilon: A floating number. A very small number for preventing ZeroDivision Error.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
      
    Returns:
      A tensor with the same shape and data dtype as `inputs`.
    '''
    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
    
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta
        
    return outputs

Position-wise Feed-Forward Networks

encoder和decoder中的Feed-Forward层是使用ReLU两个线性转换 $FFN(x) = max(0,xW_1+b_1) + b_2$ 从另一个角度描述，就是两个大小为1的卷积核输出和输出的维度是 $d_{model} = 512$ ，中间的维度 $d_{ff}=2048$

FFN是个两层的线性网络，中间层先把维度扩充N倍（N通常为4）,输出层再将维度复原

FFN代码

def feedforward(inputs, 
                num_units=[2048, 512],
                scope="multihead_attention", 
                dropout_rate=0.2,
                is_training=True,
                reuse=None):
    '''Point-wise feed forward net.
    
    Args:
      inputs: A 3d tensor with shape of [N, T, C].
      num_units: A list of two integers.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
        
    Returns:
      A 3d tensor with the same shape and dtype as inputs
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Inner layer
        params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
                  "activation": tf.nn.relu, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
        # Readout layer
        params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
                  "activation": None, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
        
        # Residual connection
        outputs += inputs
        
        # Normalize
        #outputs = normalize(outputs)
    
    return outputs

参数量分析

设Transformer的层数为 $l$ ，隐层维度为 $d$ ，多头注意力头数为 $h$ ,词表大小为 $V$ ，

Self-attention参数有Q、K、V的权重矩阵 $W_Q$ 、 $W_K$ 、 $W_V$ 及对应偏置，还有输出权重矩阵 $W_O$ 及其偏置，矩阵形状均为 $[d, d]$ ，偏置形状为 $[d]$ ，这里参数量为 $4d^2 + 4d$

FFN有两层MLP，第一层维度从 $d$ 扩展 $N$ 倍到 $Nd$ ，第二层维度还原，第一个矩阵 $[d, Nd]$ ，偏置形状 $[Nd]$ ，第二个矩阵形状 $[Nd,d]$ ，偏置 $[d]$ ，这里参数量为 $2Nd^2 + Nd + d$

输入的词表Embedding矩阵形状为 $[V, d]$ ，参数量为 $Vd$

位置向量参数量较少，忽略不计。

上面参数量总计为 $(4+2N)d^2 + (5+N + V)d$ ，这是一层Transformer的参数量，如果是 $l$ 层，参数量为 $l[(4+2N)d^2 + (5+N + V)d]$

实际中隐层维度 $d$ 一般较大，可以只看二次项，很多大模型在实际设置中扩展的倍数 $N=4$ ，对应参数量近似为 $12ld^2$

Positional Encoding

前面使用注意力机制捕获序列的语义，但是序列的顺序关系没有处理，即序列的位置，对于Seq2Seq模型非常重要。位置方式的编码有很多，论文选择三角函数来处理 $PE_{(pos,2i)} = sin(pos / 10000^{2i / d_{model}})$ $PE_{(pos,2i+1)} = cos(pos / 10000^{2i / d_{model}})$ 使用这个函数原因是包含了相对位置，对于任意位移 $k$ ， $PE_{pos+k}$ 可以表示为 $PE_{pos}$ 的线性组合，其实就是三角和函数、差函数的展开。

在这里插入图片描述

使用Self-Attention的原因

每层的计算复杂度小
Attention可以并行计算
可以轻松学习到长距离依赖