AI-Challenger Baseline 细粒度用户评论情感分析 (0.70201) 前篇前段时间抽空玩了一个比赛，

比赛官网：https://challenger.ai/competition/fsauor2018

前段时间抽空玩了一个比赛，最近好长时间没做了(太忙。。。)，分享一个简单的baseline，线上大概0.70201，还有一些提升空间，大家可以试着ensemble一下。

因为篇幅比较长，我会将 baseline 分为前篇、后篇两部分。前篇主要介绍模型，后篇主要介绍训练过程。(怕大家看睡着。。。）

文末会给出 github 地址，欢迎大家star或者fork。

赛题介绍：

在线评论的细粒度情感分析对于深刻理解商家和用户、挖掘用户情感等方面有至关重要的价值，并且在互联网行业有极其广泛的应用，主要用于个性化推荐、智能搜索、产品反馈、业务安全等。本次比赛我们提供了一个高质量的海量数据集，共包含6大类20个细粒度要素的情感倾向。参赛人员需根据标注的细粒度要素的情感倾向建立算法，对用户评论进行情感挖掘，组委将通过计算参赛者提交预测值和场景真实值之间的误差确定预测正确率，评估所提交的预测算法。

运行环境：

系统：Ubuntu 16.04
模型：Attention-RCNN、Attention-RNN 和 CapsuleNet
框架：Keras (我最喜欢的深度学习框架，当然还有 PyTorch)

数据预处理：

对应文件: Preprocess_char.ipynb

我直接使用的是char模型，所以不需要分词，用到的停用词也不多。比较粗暴，但是实测效果比 word level 好不少。

import random
random.seed = 16
import pandas as pd
from gensim.models.word2vec import Word2Vec

data = pd.read_csv("ai_challenger_sentiment_analysis_trainingset_20180816/sentiment_analysis_trainingset.csv")
  
#move stop words and generate char sent
def filter_char_map(arr):
    res = []
    for c in arr:
        if c not in stopwords and c != ' ' and c != '\xa0'and c != '\n' and c != '\ufeff' and c != '\r':
            res.append(c)
    return " ".join(res)
#get char of sentence
def get_char(arr):
    res = []
    for c in arr:
        res.append(c)
    return list(res)

data.content = data.content.map(lambda x: filter_map(x))
data.content = data.content.map(lambda x: get_char(x))
data.to_csv("preprocess/train_char.csv", index=None)

line_sent = []
for s in data["content"]:
    line_sent.append(s)
word2vec_model = Word2Vec(line_sent, size=100, window=10, min_count=1, workers=4, iter=15)
word2vec_model.wv.save_word2vec_format("word2vec/chars.vector", binary=True)

validation = pd.read_csv("ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv")
validation.content = validation.content.map(lambda x: filter_map(x))
validation.content = validation.content.map(lambda x: get_char(x))
validation.to_csv("preprocess/validation_char.csv", index=None)

test = pd.read_csv("ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv")
test.content = test.content.map(lambda x: filter_map(x))
test.content = test.content.map(lambda x: get_char(x))
test.to_csv("preprocess/test_char.csv", index=None)

经过数据预处理，在 preprocess 文件夹下生成了 train_char.csv、test_char.csv、test_char.csv 三个文件。

多分类模型

Attention Model

今年是 Attention 的时代 (Attention is all you need)，基本上分类模型都会用到 attention，而且效果也是非常不错，我比赛用的 Attention Model 参考自 Kaggle。

keras lstm attention glove840b,lb 0.043

# coding=utf8
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints
from keras.layers.merge import _Merge


class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        """
        Keras Layer that implements an Attention mechanism for temporal data.
        Supports Masking.
        Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
        # Input shape
            3D tensor with shape: `(samples, steps, features)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
        The dimensions are inferred based on the output shape of the RNN.
        Example:
            model.add(LSTM(64, return_sequences=True))
            model.add(Attention())
        """
        self.supports_masking = True
        # self.init = initializations.get('glorot_uniform')
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        input_shape = K.int_shape(x)

        features_dim = self.features_dim
        # step_dim = self.step_dim
        step_dim = input_shape[1]

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b[:input_shape[1]]

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
    	# print weigthted_input.shape
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        # return input_shape[0], input_shape[-1]
    	return input_shape[0], self.features_dim
# end Attention


class JoinAttention(_Merge):
    def __init__(self, step_dim, hid_size,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        """
        Keras Layer that implements an Attention mechanism according to other vector.
        Supports Masking.
        # Input shape, list of
            2D tensor with shape: `(samples, features_1)`.
            3D tensor with shape: `(samples, steps, features_2)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
        The dimensions are inferred based on the output shape of the RNN.
        Example:
            en = LSTM(64, return_sequences=False)(input)
            de = LSTM(64, return_sequences=True)(input2)
            output = JoinAttention(64, 20)([en, de])
        """
        self.supports_masking = True
        # self.init = initializations.get('glorot_uniform')
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.hid_size = hid_size
        super(JoinAttention, self).__init__(**kwargs)

    def build(self, input_shape):
        if not isinstance(input_shape, list):
            raise ValueError('A merge layer [JoinAttention] should be called '
                             'on a list of inputs.')
        if len(input_shape) != 2:
            raise ValueError('A merge layer [JoinAttention] should be called '
                             'on a list of 2 inputs. '
                             'Got ' + str(len(input_shape)) + ' inputs.')
        if len(input_shape[0]) != 2 or len(input_shape[1]) != 3:
            raise ValueError('A merge layer [JoinAttention] should be called '
                             'on a list of 2 inputs with first ndim 2 and second one ndim 3. '
                             'Got ' + str(len(input_shape)) + ' inputs.')

        self.W_en1 = self.add_weight((input_shape[0][-1], self.hid_size),
                                 initializer=self.init,
                                 name='{}_W0'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.W_en2 = self.add_weight((input_shape[1][-1], self.hid_size),
                                 initializer=self.init,
                                 name='{}_W1'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.W_de = self.add_weight((self.hid_size,),
                                 initializer=self.init,
                                 name='{}_W2'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)

        if self.bias:
            self.b_en1 = self.add_weight((self.hid_size,),
                                     initializer='zero',
                                     name='{}_b0'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
            self.b_en2 = self.add_weight((self.hid_size,),
                                     initializer='zero',
                                     name='{}_b1'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
            self.b_de = self.add_weight((input_shape[1][1],),
                                     initializer='zero',
                                     name='{}_b2'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b_en1 = None
            self.b_en2 = None
            self.b_de = None

        self._reshape_required = False
        self.built = True

    def compute_output_shape(self, input_shape):
        return input_shape[1][0], input_shape[1][-1]

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, inputs, mask=None):
        en = inputs[0]
        de = inputs[1]
        de_shape = K.int_shape(de)
        step_dim = de_shape[1]

        hid_en = K.dot(en, self.W_en1)
        hid_de = K.dot(de, self.W_en2)
        if self.bias:
            hid_en += self.b_en1
            hid_de += self.b_en2
        hid = K.tanh(K.expand_dims(hid_en, axis=1) + hid_de)
        eij = K.reshape(K.dot(hid, K.reshape(self.W_de, (self.hid_size, 1))), (-1, step_dim))
        if self.bias:
            eij += self.b_de[:step_dim]

        a = K.exp(eij - K.max(eij, axis=-1, keepdims=True))

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask[1], K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = de * a
        return K.sum(weighted_input, axis=1)
# end JoinAttention

下面开始进入模型部分，主要就是介绍三种模型。

需要注意，用到的 RNN Model 默认都是 GRU。

1. Attention RNN Model

比较简单，两层GRU之后接一个Attention层，起到加权平均的作用，然后和 avgpool、maxpool concat 到一块去，很直观的想法，kaggler的baseline。

import keras
from keras import Model
from keras.layers import *
from JoinAttLayer import Attention

class TextClassifier():

    def model(self, embeddings_matrix, maxlen, word_index, num_class):
        inp = Input(shape=(maxlen,))
        encode = Bidirectional(CuDNNGRU(128, return_sequences=True))
        encode2 = Bidirectional(CuDNNGRU(128, return_sequences=True))
        attention = Attention(maxlen)
        x_4 = Embedding(len(word_index) + 1,
                        embeddings_matrix.shape[1],
                        weights=[embeddings_matrix],
                        input_length=maxlen,
                        trainable=True)(inp)
        x_3 = SpatialDropout1D(0.2)(x_4)
        x_3 = encode(x_3)
        x_3 = Dropout(0.2)(x_3)
        x_3 = encode2(x_3)
        x_3 = Dropout(0.2)(x_3)
        avg_pool_3 = GlobalAveragePooling1D()(x_3)
        max_pool_3 = GlobalMaxPooling1D()(x_3)
        attention_3 = attention(x_3)
        x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3], name="fc")
        x = Dense(num_class, activation="sigmoid")(x)

        adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08,amsgrad=True)
        model = Model(inputs=inp, outputs=x)
        model.compile(
            loss='categorical_crossentropy',
            optimizer=adam)
        return model

2. Attention RCNN Model

和第一个模型不同的地方在于，在RNN之后加了一层CNN对n-gram信息进行抓取。这里为什么要把rnn放前面而不把cnn放前面，有兴趣的读者可以思考一下。

import keras
from keras import Model
from keras.layers import *
from JoinAttLayer import Attention

class TextClassifier():

    def model(self, embeddings_matrix, maxlen, word_index, num_class):
        inp = Input(shape=(maxlen,))
        encode = Bidirectional(CuDNNGRU(128, return_sequences=True))
        encode2 = Bidirectional(CuDNNGRU(128, return_sequences=True))
        attention = Attention(maxlen)
        x_4 = Embedding(len(word_index) + 1,
                        embeddings_matrix.shape[1],
                        weights=[embeddings_matrix],
                        input_length=maxlen,
                        trainable=True)(inp)
        x_3 = SpatialDropout1D(0.2)(x_4)
        x_3 = encode(x_3)
        x_3 = Dropout(0.2)(x_3)
        x_3 = encode2(x_3)
        x_3 = Dropout(0.2)(x_3)
        x_3 = Conv1D(64, kernel_size=3, padding="valid", kernel_initializer="glorot_uniform")(x_3)
        x_3 = Dropout(0.2)(x_3)
        avg_pool_3 = GlobalAveragePooling1D()(x_3)
        max_pool_3 = GlobalMaxPooling1D()(x_3)
        attention_3 = attention(x_3)
        x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3])
        x = Dense(num_class, activation="sigmoid")(x)

        adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
        model = Model(inputs=inp, outputs=x)
        model.compile(
            loss='categorical_crossentropy',
            optimizer=adam
            )
        return model

3. Capsule Model

参考链接：先读懂CapsNet架构然后用TensorFlow实现：全面解析Hinton提出的Capsule

Capsule net with GRU

import keras
from keras import Model
from keras.layers import *
from JoinAttLayer import Attention

def squash(x, axis=-1):
    # s_squared_norm is really small
    # s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon()
    # scale = K.sqrt(s_squared_norm)/ (0.5 + s_squared_norm)
    # return scale * x
    s_squared_norm = K.sum(K.square(x), axis, keepdims=True)
    scale = K.sqrt(s_squared_norm + K.epsilon())
    return x / scale


# A Capsule Implement with Pure Keras
class Capsule(Layer):
    def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(9, 1), share_weights=True,
                 activation='default', **kwargs):
        super(Capsule, self).__init__(**kwargs)
        self.num_capsule = num_capsule
        self.dim_capsule = dim_capsule
        self.routings = routings
        self.kernel_size = kernel_size
        self.share_weights = share_weights
        if activation == 'default':
            self.activation = squash
        else:
            self.activation = Activation(activation)

    def build(self, input_shape):
        super(Capsule, self).build(input_shape)
        input_dim_capsule = input_shape[-1]
        if self.share_weights:
            self.W = self.add_weight(name='capsule_kernel',
                                     shape=(1, input_dim_capsule,
                                            self.num_capsule * self.dim_capsule),
                                     # shape=self.kernel_size,
                                     initializer='glorot_uniform',
                                     trainable=True)
        else:
            input_num_capsule = input_shape[-2]
            self.W = self.add_weight(name='capsule_kernel',
                                     shape=(input_num_capsule,
                                            input_dim_capsule,
                                            self.num_capsule * self.dim_capsule),
                                     initializer='glorot_uniform',
                                     trainable=True)

    def call(self, u_vecs):
        if self.share_weights:
            u_hat_vecs = K.conv1d(u_vecs, self.W)
        else:
            u_hat_vecs = K.local_conv1d(u_vecs, self.W, [1], [1])

        batch_size = K.shape(u_vecs)[0]
        input_num_capsule = K.shape(u_vecs)[1]
        u_hat_vecs = K.reshape(u_hat_vecs, (batch_size, input_num_capsule,
                                            self.num_capsule, self.dim_capsule))
        u_hat_vecs = K.permute_dimensions(u_hat_vecs, (0, 2, 1, 3))
        # final u_hat_vecs.shape = [None, num_capsule, input_num_capsule, dim_capsule]

        b = K.zeros_like(u_hat_vecs[:, :, :, 0])  # shape = [None, num_capsule, input_num_capsule]
        for i in range(self.routings):
            b = K.permute_dimensions(b, (0, 2, 1))  # shape = [None, input_num_capsule, num_capsule]
            c = K.softmax(b)
            c = K.permute_dimensions(c, (0, 2, 1))
            b = K.permute_dimensions(b, (0, 2, 1))
            outputs = self.activation(K.batch_dot(c, u_hat_vecs, [2, 2]))
            if i < self.routings - 1:
                b = K.batch_dot(outputs, u_hat_vecs, [2, 3])

        return outputs

    def compute_output_shape(self, input_shape):
        return (None, self.num_capsule, self.dim_capsule)


class TextClassifier():

    def model(self, embeddings_matrix, maxlen, word_index, num_class):
        input1 = Input(shape=(maxlen,))
        embed_layer = Embedding(len(word_index) + 1,
                                embeddings_matrix.shape[1],
                                input_length=maxlen,
                                weights=[embeddings_matrix],
                                trainable=True)(input1)
        embed_layer = SpatialDropout1D(0.28)(embed_layer)

        x = Bidirectional(
            CuDNNGRU(128, return_sequences=True))(
            embed_layer)
        x = Activation('relu')(x)
        x = Dropout(0.25)(x)
        x = Bidirectional(
            CuDNNGRU(128,  return_sequences=True))(
            x)
        x = Activation('relu')(x)
        x = Dropout(0.25)(x)
        capsule = Capsule(num_capsule=10, dim_capsule=16, routings=5,
                          share_weights=True)(x)
        # output_capsule = Lambda(lambda x: K.sqrt(K.sum(K.square(x), 2)))(capsule)
        capsule = Flatten()(capsule)
        capsule = Dropout(0.25)(capsule)
        output = Dense(num_class, activation='sigmoid')(capsule)
        model = Model(inputs=input1, outputs=output)
        model.compile(
            loss='binary_crossentropy',
            optimizer='adam',
            metrics=["categorical_accuracy"])
        return model

代码是最好的文档，感觉贴出来代码之后，不需要再附加什么了。下一篇将介绍训练的过程，会有比较详细的解读。相比模型而言，我个人觉得训练的过程更重要，它能够将你模型的威力发挥到极致。

欢迎各位大佬评论拍砖。

Github 地址: pengshuang/AI-Comp