AI-Challenger Baseline 细粒度用户评论情感分析 (0.70201) 后篇

282 阅读6分钟
原文链接: zhuanlan.zhihu.com

承接上一篇,AI-Challenger Baseline 细粒度用户评论情感分析 (0.70201) 前篇

感谢大家的关注,第二篇文章我将详细介绍模型训练的过程,其中会适当融入我个人在训练时的一些经验吧,和大家分享讨论一下。

当我们把模型确定下来之后,接下来就是如何去训练它了。

以下是我个人总结的训练模型中比较关键的地方。

Loss Function(损失函数)

针对本赛题,如果我们不用 seq2seq 的思路去做的话,简单分成20个模型分别来训练的话,其实 loss function 设置为 categorical crossentropy 或者 binary crossentropy 都可以。

简单说下这两个损失函数的区别,categorical crossentropy 用来做多分类问题,binary crossentropy 用来做多标签分类问题。多分类和多标签有什么区别呢,就是一个目标类别只有一种,一个目标类别可以有多种。

至于这道题里面,这两个损失函数我都试过,因为最后是取预测类别对应概率最大的,所以两个损失函数差别很小。但是如果有同学想去用 seq2seq 的思路去做这道题,那边我建议使用 binary crossentropy

以上是两种比较传统的损失函数,其实还可以去优化 MSE。这里留一个问题,数学比较好的同学可以去思考下,能否直接优化 f1-score 了,虽然 f-score 本身不可导,但我们能不能设计近似的 f1-loss 呢?可以去思考下这个问题。

Early Stop(提前停止)

这道题目对于使用 keras 的同学,需要注意的是,metric 的设置,如果我们在训练中设置 metric 的话,其实得到是每个 batch 的 f-score 值(非常不靠谱),所以我们需要在每个 epoch 结束之后去计算模型的 f-score 值,这样方便我们去掌握模型的训练情况。

类似这样

def getClassification(arr):
    arr = list(arr)
    if arr.index(max(arr)) == 0:
        return -2
    elif arr.index(max(arr)) == 1:
        return -1
    elif arr.index(max(arr)) == 2:
        return 0
    else:
        return 1

class Metrics(Callback):
    def on_train_begin(self, logs={}):
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []

    def on_epoch_end(self, epoch, logs={}):
        val_predict = list(map(getClassification, self.model.predict(self.validation_data[0])))
        val_targ = list(map(getClassification, self.validation_data[1]))
        _val_f1 = f1_score(val_targ, val_predict, average="macro")
        _val_recall = recall_score(val_targ, val_predict, average="macro")
        _val_precision = precision_score(val_targ, val_predict, average="macro")
        self.val_f1s.append(_val_f1)
        self.val_recalls.append(_val_recall)
        self.val_precisions.append(_val_precision)
        print(_val_f1, _val_precision, _val_recall)
        print("max f1")
        print(max(self.val_f1s))
        return

early_stop,顾名思义,就是在训练模型的时候,当在验证集上效果不再提升的时候,就提前停止训练,节约时间。通过设置 patience 来调节。

Class Weight(类别权重)

这个 class weight 是我一直觉得比较玄学的地方,一般而言,当数据集样本不均衡的时候,通过设置正负样本权重,可以提高一些效果,但是在这道题目里面,我对4个类别分别设置了class_weight 之后,我发现效果竟然变得更差了。这里也希望知道原因的同学能留下评论,一起交流学习。

EMA(指数平滑)

参考:http://zangbo.me/2017/07/01/TensorFlow_6/指数滑动平均(ExponentialMovingAverage)EMA - 年轻即出发, - CSDN博客

EMA 应该有不少同学听过,它被广泛的应用在深度学习的BN层中,RMSprop,adadelta,adam等梯度下降方法中。

它添加了训练参数的影子副本,并保持了其影子副本中训练参数的移动平均值操作。在每次训练之后调用此操作,更新移动平均值。

我个人感觉加了 EMA 之后,能够有效的防止参数更新过快,起到了一种类似 bagging 的作用吧。

Learning Rate (学习率)

在训练模型的时候,我们可以使用动态衰减的学习率,来避免模型停留在局部最优。

我个人的经验如下:

  1. 以默认学习率 (0.001) 将模型迭代足够多次,保留验证正确率最高的模型;
  2. 加载上一步最优模型,学习率降到 0.0001,继续训练模型,保留验证正确率最高的模型;
  3. 加载上一步最优模型,去掉正则化策略(dropout 等),学习率调为0.00001,训练至最优。

Max Length (padding 的最大句子长度)

这个看似不重要,其实确实很重要的点。一开我以为 padding 的最大长度取整个评论平均的长度的2倍差不多就可以啦(对于char level 而言,max_length 取 400左右),但是会发现效果上不去,当时将 max_length 改为 1000 之后,macro f-score提示明显,我个人认为是在多分类问题中,那些长度很长的评论可能会有部分属于那些样本数很少的类别,padding过短会导致这些长评论无法被正确划分。

训练代码例子

如果要一次性训练20个模型的话,记得加上 python 的 gc 和 keras 的 clear_session。

from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))
import random
random.seed = 42
import pandas as pd
from tensorflow import set_random_seed
set_random_seed(42)
from keras.preprocessing import text, sequence
from keras.callbacks import ModelCheckpoint, Callback
from sklearn.metrics import f1_score, recall_score, precision_score
from keras.layers import *
from classifier_bigru import TextClassifier
from gensim.models.keyedvectors import KeyedVectors
import pickle
import gc


def getClassification(arr):
    arr = list(arr)
    if arr.index(max(arr)) == 0:
        return -2
    elif arr.index(max(arr)) == 1:
        return -1
    elif arr.index(max(arr)) == 2:
        return 0
    else:
        return 1


class Metrics(Callback):
    def on_train_begin(self, logs={}):
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []

    def on_epoch_end(self, epoch, logs={}):
        val_predict = list(map(getClassification, self.model.predict(self.validation_data[0])))
        val_targ = list(map(getClassification, self.validation_data[1]))
        _val_f1 = f1_score(val_targ, val_predict, average="macro")
        _val_recall = recall_score(val_targ, val_predict, average="macro")
        _val_precision = precision_score(val_targ, val_predict, average="macro")
        self.val_f1s.append(_val_f1)
        self.val_recalls.append(_val_recall)
        self.val_precisions.append(_val_precision)
        print(_val_f1, _val_precision, _val_recall)
        print("max f1")
        print(max(self.val_f1s))
        return


data = pd.read_csv("preprocess/train_char.csv")
data["content"] = data.apply(lambda x: eval(x[1]), axis=1)

validation = pd.read_csv("preprocess/validation_char.csv")
validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1)

model_dir = "model_bigru_char/"
maxlen = 1200
max_features = 20000
batch_size = 128
epochs = 15
tokenizer = text.Tokenizer(num_words=None)
tokenizer.fit_on_texts(data["content"].values)
with open('tokenizer_char.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

word_index = tokenizer.word_index
w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8',
                                             unicode_errors='ignore')
embeddings_index = {}
embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size))
word2idx = {"_PAD": 0}
vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()]

for word, i in word_index.items():
    if word in w2_model:
        embedding_vector = w2_model[word]
    else:
        embedding_vector = None
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector

X_train = data["content"].values
Y_train_ltc = pd.get_dummies(data["location_traffic_convenience"])[[-2, -1, 0, 1]].values
Y_train_ldfbd = pd.get_dummies(data["location_distance_from_business_district"])[[-2, -1, 0, 1]].values
Y_train_letf = pd.get_dummies(data["location_easy_to_find"])[[-2, -1, 0, 1]].values
Y_train_swt = pd.get_dummies(data["service_wait_time"])[[-2, -1, 0, 1]].values
Y_train_swa = pd.get_dummies(data["service_waiters_attitude"])[[-2, -1, 0, 1]].values
Y_train_spc = pd.get_dummies(data["service_parking_convenience"])[[-2, -1, 0, 1]].values
Y_train_ssp = pd.get_dummies(data["service_serving_speed"])[[-2, -1, 0, 1]].values
Y_train_pl = pd.get_dummies(data["price_level"])[[-2, -1, 0, 1]].values
Y_train_pce = pd.get_dummies(data["price_cost_effective"])[[-2, -1, 0, 1]].values
Y_train_pd = pd.get_dummies(data["price_discount"])[[-2, -1, 0, 1]].values
Y_train_ed = pd.get_dummies(data["environment_decoration"])[[-2, -1, 0, 1]].values
Y_train_en = pd.get_dummies(data["environment_noise"])[[-2, -1, 0, 1]].values
Y_train_es = pd.get_dummies(data["environment_space"])[[-2, -1, 0, 1]].values
Y_train_ec = pd.get_dummies(data["environment_cleaness"])[[-2, -1, 0, 1]].values
Y_train_dp = pd.get_dummies(data["dish_portion"])[[-2, -1, 0, 1]].values
Y_train_dt = pd.get_dummies(data["dish_taste"])[[-2, -1, 0, 1]].values
Y_train_dl = pd.get_dummies(data["dish_look"])[[-2, -1, 0, 1]].values
Y_train_dr = pd.get_dummies(data["dish_recommendation"])[[-2, -1, 0, 1]].values
Y_train_ooe = pd.get_dummies(data["others_overall_experience"])[[-2, -1, 0, 1]].values
Y_train_owta = pd.get_dummies(data["others_willing_to_consume_again"])[[-2, -1, 0, 1]].values

X_validation = validation["content"].values
Y_validation_ltc = pd.get_dummies(validation["location_traffic_convenience"])[[-2, -1, 0, 1]].values
Y_validation_ldfbd = pd.get_dummies(validation["location_distance_from_business_district"])[[-2, -1, 0, 1]].values
Y_validation_letf = pd.get_dummies(validation["location_easy_to_find"])[[-2, -1, 0, 1]].values
Y_validation_swt = pd.get_dummies(validation["service_wait_time"])[[-2, -1, 0, 1]].values
Y_validation_swa = pd.get_dummies(validation["service_waiters_attitude"])[[-2, -1, 0, 1]].values
Y_validation_spc = pd.get_dummies(validation["service_parking_convenience"])[[-2, -1, 0, 1]].values
Y_validation_ssp = pd.get_dummies(validation["service_serving_speed"])[[-2, -1, 0, 1]].values
Y_validation_pl = pd.get_dummies(validation["price_level"])[[-2, -1, 0, 1]].values
Y_validation_pce = pd.get_dummies(validation["price_cost_effective"])[[-2, -1, 0, 1]].values
Y_validation_pd = pd.get_dummies(validation["price_discount"])[[-2, -1, 0, 1]].values
Y_validation_ed = pd.get_dummies(validation["environment_decoration"])[[-2, -1, 0, 1]].values
Y_validation_en = pd.get_dummies(validation["environment_noise"])[[-2, -1, 0, 1]].values
Y_validation_es = pd.get_dummies(validation["environment_space"])[[-2, -1, 0, 1]].values
Y_validation_ec = pd.get_dummies(validation["environment_cleaness"])[[-2, -1, 0, 1]].values
Y_validation_dp = pd.get_dummies(validation["dish_portion"])[[-2, -1, 0, 1]].values
Y_validation_dt = pd.get_dummies(validation["dish_taste"])[[-2, -1, 0, 1]].values
Y_validation_dl = pd.get_dummies(validation["dish_look"])[[-2, -1, 0, 1]].values
Y_validation_dr = pd.get_dummies(validation["dish_recommendation"])[[-2, -1, 0, 1]].values
Y_validation_ooe = pd.get_dummies(validation["others_overall_experience"])[[-2, -1, 0, 1]].values
Y_validation_owta = pd.get_dummies(validation["others_willing_to_consume_again"])[[-2, -1, 0, 1]].values

list_tokenized_train = tokenizer.texts_to_sequences(X_train)
input_train = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)

list_tokenized_validation = tokenizer.texts_to_sequences(X_validation)
input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen)

print("model1")
model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
file_path = model_dir + "model_ltc_{epoch:02d}.hdf5"
checkpoint = ModelCheckpoint(file_path, verbose=2, save_weights_only=True)
metrics = Metrics()
callbacks_list = [checkpoint, metrics]
history = model1.fit(input_train, Y_train_ltc, batch_size=batch_size, epochs=epochs,
                     validation_data=(input_validation, Y_validation_ltc), callbacks=callbacks_list, verbose=2)
del model1
del history
gc.collect()
K.clear_session()

print("model2")
model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
file_path = model_dir + "model_ldfbd_{epoch:02d}.hdf5"
checkpoint = ModelCheckpoint(file_path, verbose=2, save_weights_only=True)
metrics = Metrics()
callbacks_list = [checkpoint, metrics]
history = model2.fit(input_train, Y_train_ldfbd, batch_size=batch_size, epochs=epochs,
                     validation_data=(input_validation, Y_validation_ldfbd), callbacks=callbacks_list, verbose=2)
del model2
del history
gc.collect()
K.clear_session()

print("model3")
model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
file_path = model_dir + "model_letf_{epoch:02d}.hdf5"
checkpoint = ModelCheckpoint(file_path, verbose=2, save_weights_only=True)
metrics = Metrics()
callbacks_list = [checkpoint, metrics]
history = model3.fit(input_train, Y_train_letf, batch_size=batch_size, epochs=epochs,
                     validation_data=(input_validation, Y_validation_letf), callbacks=callbacks_list, verbose=2)
del model3
del history
gc.collect()
K.clear_session()
。。。

Github地址:pengshuang/AI-Comp