简单的文本分类任务:使用Huggingface Trainer实现

1,412 阅读4分钟

携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第30天,点击查看活动详情

说明

之前的一篇博客NLP实战高手课学习笔记(14):文本分类实践1--预处理与模型定义 为大家展示了在学习NLP实战高手课中的一个简单的文本分类示例,由于当时采用的是torchtext的方式对数据集进行处理,而且模型也是简单的LSTM,并非目前的主流趋势。本篇博客将介绍如何使用Huggingface Transformers库来快速上手实现一个现代的文本分类模型的范式。

任务简介

文本情感分析(Sentiment analysis)是根据数据的情感,如积极、消极和中性,自动标注数据的过程。情感分析使公司能够大规模分析数据,发现见解并自动化这些过程。一个著名的情感分析数据集IMDB影评数据集包含了大量的用户影评,这些影评都有着强烈的正向或反向情感。一个示例数据如下:

It hurt to watch this movie, it really did... I wanted to like it, even going in. 
Shot obviously for very little cash, I looked past and told myself to appreciate the inspiration. 
Unfortunately, although I did appreciate the film on that level, the acting and editing was terrible, and the last 25-30 minutes were severe thumb-twiddling territory. 
A 95 minute film should not drag. The ratings for this one are good so far, but I fear that the friends and family might have had a say in that one. What was with those transitions? 
Dear Mr. Editor, did you just purchase your first copy of Adobe Premiere and make it your main goal to use all the goofy transitions that come with that silly program? 
Anyway... some better actors, a little more passion, and some more appealing editing and this makes a decent movie.

以上是一段标记为负向情感的文本数据,我们需要模型能够对这样一大段长文本进行二分类任务。 上篇博客我们了解了Huggingface的Pipeline模块,这里我们也试一下。

使用Pipeline

我们使用Transformers库中的Pipeline模型快速进行开发,相关代码如下:

from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

这里我们以列表的形式添加了两个测试样例,得到的输出为:

[{'label': 'POSITIVE', 'score': 0.9998}, {'label': 'NEGATIVE', 'score': 0.9991}]

可以看到,pipeline返回的结果正如预期,实际上这两个测试样例还是比较简单的。

使用Trainer进行模型训练

上面我们使用Pipeline方法对情感分析任务小试牛刀,接下来我们回归正题,即使用Trainer方式训练自己的模型。

数据集加载

首先,我们加载IMDB影评数据,由于数据集太大,这里只取3000个数据用作训练,300个用作测试。

from datasets import load_dataset

imdb = load_dataset("imdb")
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])

数据预处理

获得数据后,我们对其进行分词和编码,相关代码如下:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)
 
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

这里,我们将输入的text全部使用tokenizer进行分词转换为输入的token id序列。

为了加快训练速度,我们使用data_collator将训练样本转换为PyTorch张量,并将它们与正确数量的填充连接起来:

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

模型加载

接着,我们初始化模型,这里以distilbert为例,

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

这里我们设置num_labels=2,明确为2分类任务。

设置评估函数

然后我们需要设置一个评估的函数,用来评价模型的性能。其定义如下:

import numpy as np
from datasets import load_metric
 
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

这里我们统计了二分类任务的预测准确性和F1分数。

初始化trainer

接着,我们来初始化一个trainer来准备训练,首先需要定义好trainer的参数:

from transformers import TrainingArguments, Trainer
 
repo_name = "./finetuning-sentiment-model-1000-samples"
 
training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=False,
)

这里,我们选择将模型的checkpoint存储路径设置为本地目录,学习率为2E-5,train和为evalution的batch_size都设置为了16,整理的训练epoch为2,并设置了weight decay为0.01。

设置好这些参数,我们实例化一个trainer

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

进行训练

最后,我们一行代码实现训练。

trainer.train()

训练完成后,得到输出结果如下:

TrainOutput(global_step=126, training_loss=0.39784825037396143, metrics={'train_runtime': 7407.6124, 'train_samples_per_second': 0.27, 'train_steps_per_second': 0.017, 'total_flos': 263009880425280.0, 'train_loss': 0.39784825037396143, 'epoch': 2.0})

参考

  1. Getting Started with Sentiment Analysis using Python,huggingface.co/blog/sentim…