1. attention机制

attention机制：又称为注意力机制，顾名思义，是一种能让模型对重要信息重点关注并充分学习吸收的技术，能够作用于任何序列模型中。

对于 Attention的作用角度出发，我们就可以从两个角度来分类 Attention种类：空间注意力 Spatial Attention、时间注意力 Temporal Attention。这样的分类更多的是从应用层面上，而从 Attention的作用方法上，可以将其分为 Soft Attention 和 Hard Attention，这既我们所说的， Attention输出的向量分布是一种one-hot的独热分布还是soft的软分布，这直接影响对于上下文信息的选择作用。

为什么要加入Attention：

当输入序列非常长时，模型难以学到合理的向量表示

序列输入时，随着序列的不断增长，原始根据时间步的方式的表现越来越差，这是由于原始的这种时间步模型设计的结构有缺陷，即所有的上下文输入信息都被限制到固定长度，整个模型的能力都同样收到限制，我们暂且把这种原始的模型称为简单的编解码器模型。

编解码器的结构无法解释，也就导致了其无法设计。

Attention机制的基本思想是：打破了传统编码器-解码器结构在编解码时都依赖于内部一个固定长度向量的限制。 Attention机制的实现是通过保留LSTM编码器对输入序列的中间输出结果，然后训练一个模型来对这些输入进行选择性的学习并且在模型输出时将输出序列与之进行关联。

换一个角度而言，输出序列中的每一项的生成概率取决于在输入序列中选择了哪些项。

Attention-based Model 其实就是一个相似性的度量，当前的输入与目标状态约相似，那么在当前的输入的权重就会越大。就是在原有的model上加入了Attention的思想。

具体而言，采用传统编码器-解码器结构的LSTM/RNN模型存在一个问题：不论输入长短都将其编码成一个固定长度的向量表示，这使模型对于长输入序列的学习效果很差（解码效果很差）。而attention机制则克服了上述问题，原理是在模型输出时会选择性地专注考虑输入中的对应相关的信息。使用attention机制的方法被广泛应用在各种序列预测任务上，包括文本翻译、语音识别等。

2.代码

 import os

import re

import csv

import codecs

import numpy as np

import pandas as pd

from keras import backend as K

from keras.engine.topology import Layer

from keras import initializers, regularizers, constraints

np.random.seed(2018)

class Attention(Layer):

    def __init__(self,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        # self.init = initializations.get('glorot_uniform')
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias

        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        self.step_dim = input_shape[1]
        assert len(input_shape) == 3 # batch ,timestep , num_features
        print(input_shape)
        self.W = self.add_weight((input_shape[-1],), #num_features
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),#timesteps
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim
        print(K.reshape(x, (-1, features_dim)))# n, d
        print(K.reshape(self.W, (features_dim, 1)))# w= dx1
        print(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))))#nx1

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))#batch,step
        print(eij)
        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        print(a)
        a = K.expand_dims(a)
        print("expand_dims:")
        print(a)
        print("x:")
        print(x)
        weighted_input = x * a
        print(weighted_input.shape)
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        # return input_shape[0], input_shape[-1]
        return input_shape[0], self.features_dim

3.总结

总的来说，attention的机制就是一个加权求和的机制，只要我们使用了加权求和，不管你是怎么花式加权，花式求和，只要你是根据了已有信息计算的隐藏状态的加权和求和，那么就是使用了attention，而所谓的self attention就是仅仅在句子内部做加权求和（区别与seq2seq里面的decoder对encoder的隐藏状态做的加权求和）。

self attention我个人认为作用范围更大一点，而key-value其实是对attention进行了一个更广泛的定义罢了，我们前面的attention都可以套上key-value attention，比如很多时候我们是把k和v都当成一样的算来，做self的时候还可能是quey=key=value。